CloudFormation in Production: What Breaks and How to Fix It

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5168

    #1

    CloudFormation in Production: What Breaks and How to Fix It

    Moving past YAML templates to failure handling, security, and real tradeoffs





    Before we start

    This is a follow-up to Infrastructure as Code with AWS CloudFormation: From Fundamentals to Production Patterns.


    That article covered templates, stacks, nested stacks, CI/CD, and production best practices.


    This article covers what happens when those best practices aren't enough. When things break in ways the documentation doesn't warn you about. When you're reading CloudFormation error messages at midnight and need answers.





    Part 1: Stack deployment failures

    Failure 1: "Resource handler returned message: 'Role does not exist'"

    Symptoms:
    • IAM role creates successfully (status: CREATE_COMPLETE)
    • Lambda or EC2 resource fails immediately after
    • Error: "The role named 'xxx' does not exist or is not authorized"


    Root cause:

    IAM has eventual consistency. CloudFormation marks the role as complete as soon as the API call returns, but the role may take 5-10 seconds to propagate across AWS partitions.


    Fix:






    LambdaFunction:
    Type: AWS::Lambda::Function
    DependsOn: LambdaExecutionRole
    Properties:
    Role: !GetAtt LambdaExecutionRole.Arn







    DependsOn forces CloudFormation to wait for the role resource to be fully created "including its propagation" before creating the Lambda function.


    Prevention:

    Always add DependsOn when a resource consumes an IAM role created in the same stack.



    Failure 2: Stack timeout without clear cause

    Symptoms:
    • Stack creation or update times out after the configured limit
    • No obvious error in event log
    • Some resources show CREATE_IN_PROGRESS for hours


    Root cause:

    Resources with CreationPolicy or WaitCondition are waiting for signals that never arrive. Common causes:
    • EC2 instance user data script fails silently
    • Custom resource Lambda times out
    • Application code never calls cfn-signal


    Diagnosis:






    # Check if any resources have CreationPolicy
    aws cloudformation describe-stack-resources --stack-name prod-stack \
    --query "StackResources[?ResourceStatus=='CREATE_IN_PROGRESS']"

    # For EC2, check user data logs on the instance
    cat /var/log/cloud-init-output.log







    Fix:


    For EC2 with user data:






    #!/bin/bash
    # Do your setup here

    # Signal success or failure
    /opt/aws/bin/cfn-signal --exit-code $? --stack ${AWS::StackName} \
    --resource WebServerInstance --region ${AWS::Region}







    For custom resources, implement timeout handling:






    def handler(event, context):
    try:
    # Do work
    send_response(event, context, "SUCCESS")
    except Exception as e:
    # CRITICAL: Always send a response
    send_response(event, context, "FAILED", reason=str(e))







    Prevention:

    Always test CreationPolicy paths with --disable-rollback first so you can inspect failed resources without automatic cleanup.



    Failure 3: Nested stack update fails, root cause invisible

    Symptoms:
    • Parent stack update fails
    • Error message: "Nested stack failed to update"
    • No details about why the nested stack failed


    Root cause:

    CloudFormation does not bubble up nested stack failure details to the parent. You have to check each nested stack individually.


    Diagnosis:






    # List nested stacks from the parent
    aws cloudformation list-stack-resources --stack-name parent-stack \
    --query "StackResources[?ResourceType=='AWS::CloudFormation::Stack'].[PhysicalResourceId]"

    # Check each nested stack's events
    aws cloudformation describe-stack-events --stack-name nested-stack-1







    Fix:


    Add explicit validation before updating parents:






    # Validate nested template before updating parent
    aws cloudformation validate-template --template-body file://nested.yaml

    # Check nested stack for drift
    aws cloudformation detect-stack-drift --stack-name nested-stack-1







    Prevention:

    Minimize nested stack depth (2 levels maximum). For complex dependencies, use StackSets or split into separate parent stacks.



    Part 2: Drift and configuration mismatch

    Failure 4: Production resource changed outside CloudFormation

    Symptoms:
    • Security group rule allows unexpected traffic
    • S3 bucket becomes public
    • RDS backup retention period changes
    • No corresponding change in Git history


    Root cause:

    Someone modified a resource directly in the AWS console or via CLI, bypassing CloudFormation.


    Diagnosis:






    # Detect drift on a stack
    aws cloudformation detect-stack-drift --stack-name prod-web

    # Get detailed drift results
    aws cloudformation describe-stack-drift-detection-status \
    --stack-drift-detection-id id>

    # List drifted resources
    aws cloudformation list-stack-resources --stack-name prod-web \
    --query "StackResources[?DriftInformation.StackResourceDriftStatus!='NOT_C HECKED']"







    Fix — manual:






    # Import drifted resource back to CloudFormation
    aws cloudformation import-stack-to-drift --stack-name prod-web \
    --template-body file://template.yaml \
    --resources-to-import '[{"ResourceType":"AWS::S3::Bucket","LogicalResource Id":"DataBucket"}]'







    Fix — automated:






    # CloudWatch Event to detect drift weekly
    DriftDetectionRule:
    Type: AWS::Events::Rule
    Properties:
    ScheduleExpression: "cron(0 12 * * 1)" # Every Monday at noon
    Targets:
    - Arn: !GetAtt DriftLambda.Arn
    Input: '{"stackName": "prod-web"}'







    Prevention:
    • Enforce IAM policies that prevent resource modification outside CloudFormation
    • Enable drift detection on all production stacks
    • Review drift reports weekly





    Failure 5: Stack drift causes deletion protection to block cleanup

    Symptoms:
    • Trying to delete a stack
    • Error: "Cannot delete stack because resource X has deletion protection"
    • That resource was not supposed to have deletion protection


    Root cause:

    Someone enabled deletion protection directly on an RDS database or S3 bucket. CloudFormation doesn't know about it.


    Diagnosis:






    # Find which resource is blocking deletion
    aws cloudformation describe-stack-resources --stack-name prod-stack \
    --query "StackResources[?ResourceStatus=='DELETE_FAILED']"







    Fix:






    # Remove deletion protection from the resource directly
    aws rds modify-db-instance --db-instance-identifier mydb \
    --no-deletion-protection

    # Or for S3
    aws s3api put-bucket-versioning --bucket mybucket \
    --versioning-configuration Status=Suspended

    # Retry stack deletion
    aws cloudformation delete-stack --stack-name prod-stack







    Prevention:

    Include DeletionPolicy: Retain in your template for stateful resources, not deletion protection. DeletionPolicy is understood by CloudFormation. Deletion protection is not.



    Part 3: Rollback failures

    Failure 6: Rollback fails because resource won't delete

    Symptoms:
    • Stack update fails
    • Rollback starts
    • Rollback fails
    • Stack stuck in ROLLBACK_FAILED


    Root cause:

    A resource created during the failed update cannot be deleted. Common reasons:
    • S3 bucket has versioning enabled and contains objects
    • RDS has deletion protection enabled
    • Network interface is still attached
    • Custom resource performed external actions


    Diagnosis:






    # Find which resource caused rollback failure
    aws cloudformation describe-stack-events --stack-name prod-stack \
    --query "StackEvents[?ResourceStatus=='DELETE_FAILED']"







    Fix - for S3:






    # Empty the bucket first
    aws s3 rm s3://bucket-name --recursive

    # Disable versioning
    aws s3api put-bucket-versioning --bucket bucket-name \
    --versioning-configuration Status=Suspended

    # Retry stack deletion
    aws cloudformation delete-stack --stack-name prod-stack







    Fix - for RDS:






    # Disable deletion protection
    aws rds modify-db-instance --db-instance-identifier mydb \
    --no-deletion-protection

    # Skip final snapshot if you want fast cleanup
    aws rds delete-db-instance --db-instance-identifier mydb \
    --skip-final-snapshot







    Prevention:

    Design stateful resources with DeletionPolicy: Retain in production. Accept that you will clean them manually. Do not let stateful resources block automated rollbacks.



    Failure 7: Rollback takes too long, extending downtime

    Symptoms:
    • Stack update fails at minute 15
    • Rollback takes another 20 minutes
    • Total downtime: 35+ minutes


    Root cause:

    Resources with DeletionPolicy: Snapshot take time to create snapshots during rollback. RDS snapshots can take 10-20 minutes. EBS snapshots add minutes per volume.


    Diagnosis:






    # Check which resource is taking time during rollback
    aws cloudformation describe-stack-events --stack-name prod-stack \
    --query "StackEvents[?contains(ResourceStatus, 'DELETE')]"







    Fix during incident:

    You have limited options once rollback starts. The fastest path is often to let it finish, even if slow.


    Prevention:

    Separate stateful resources (databases, buckets) into their own stack. This stack changes rarely. Application stacks change frequently but contain no stateful resources.






    # Stack 1: Data (deploys monthly, rollback takes time but happens rarely)
    DatabaseStack:
    Type: AWS::RDS:BInstance
    DeletionPolicy: Snapshot

    # Stack 2: Application (deploys daily, rollback is fast)
    AppStack:
    Type: AWS::AutoScaling::AutoScalingGroup
    DeletionPolicy: Delete # No snapshot, instant deletion







    When AppStack fails, rollback takes seconds, not minutes. Database is untouched.





    Part 4: IAM and permission failures

    Failure 8: "User is not authorized to perform cloudformation:CreateStack"

    Symptoms:
    • CI/CD pipeline fails
    • Error message about missing permission
    • Same permissions worked yesterday


    Root cause:

    IAM policies changed. A condition was added. A permission was removed. The role used by CI/CD no longer has required access.


    Diagnosis:






    # Simulate policy to find missing permission
    aws cloudformation create-stack --stack-name test-stack \
    --template-body file://test.yaml \
    --dry-run

    # Check effective permissions for the role
    aws iam simulate-principal-policy \
    --policy-source-arn arn:aws:iam::123456789012:role/ci-cd-role \
    --action-names cloudformation:CreateStack \
    --resource-arns arn:aws:cloudformation:us-east-1:123456789012:stack/*







    Fix:

    Add the missing permission to the CI/CD role:






    {
    "Effect": "Allow",
    "Action": "cloudformation:CreateStack",
    "Resource": "arn:aws:cloudformation:region:account:stack/*"
    }







    Prevention:

    Use IAM boundaries and permission guardrails. Test CI/CD role permissions in a staging account before deploying to production.



    Failure 9: Cross-account stack operations fail

    Symptoms:
    • Stack in Account A tries to create a resource in Account B
    • Error: "Access denied" or "Role does not exist"


    Root cause:

    CloudFormation does not natively support cross-account resource creation. You need IAM roles in both accounts with trust relationships.


    Fix — setup cross-account role in target account:






    # In Account B (target)
    CrossAccountRole:
    Type: AWS::IAM::Role
    Properties:
    AssumeRolePolicyDocument:
    Statement:
    - Effect: Allow
    Principal:
    AWS: arn:aws:iam::AccountA:root
    Action: sts:AssumeRole
    ManagedPolicyArns:
    - arn:aws:iam::awsolicy/AdministratorAccess # Scope down in production







    Fix — assume role from source account:






    # In Account A (source)
    CustomResource:
    Type: Custom::CrossAccount
    Properties:
    ServiceToken: !GetAtt CrossAccountLambda.Arn
    TargetRoleArn: arn:aws:iam::AccountB:role/CrossAccountRole







    Prevention:

    Design stacks to be account-specific. Use AWS Organizations and StackSets for multi-account deployments instead of cross-account resource references.



    Part 5: Template validation failures that only appear at deploy time

    Failure 10: Template validates but deployment fails

    Symptoms:






    aws cloudformation validate-template --template-body file://template.yaml
    # Returns: Template is valid







    But deployment fails with: "Encountered unsupported property" or "Resource handler returned invalid request"


    Root cause:

    validate-template checks syntax and basic schema. It does not check:
    • Resource property combinations that are invalid (e.g., certain combinations of SourceSecurityGroupId and CidrIp)
    • Region-specific limitations (some resources not available in all regions)
    • Service limits (e.g., requesting 2000 IOPS when limit is 1000)


    Diagnosis:


    Deploy with --disable-rollback to keep failed resources for inspection:






    aws cloudformation create-stack --stack-name test-stack \
    --template-body file://template.yaml \
    --disable-rollback







    Then examine the failed resource's status reason:






    aws cloudformation describe-stack-resources --stack-name test-stack \
    --query "StackResources[?ResourceStatus=='CREATE_FAILED']"







    Fix:

    Correct the specific property combination. Check region availability. Request service limit increases before deployment.


    Prevention:

    Test in a staging region first. Use cfn-lint in CI/CD — it catches property combination errors that validate-template misses.






    # Install cfn-lint
    pip install cfn-lint

    # Run locally before commit
    cfn-lint template.yaml










    Part 6: Change set failures

    Failure 11: Change set shows replacement when you expected modification

    Symptoms:
    • Change set indicates "Replacement" for a production resource
    • You expected an in-place modification
    • Replacement means downtime


    Root cause:

    Certain property changes force replacement. For RDS: changing EngineVersion or DBInstanceClass sometimes requires replacement depending on the version difference.


    Diagnosis:


    Check which property triggered replacement:






    aws cloudformation describe-change-set --change-set-name my-change-set \
    --query "Changes[?ResourceChange.Replacement=='True']"







    Common properties that force replacement:


    AWS::RDS:BInstance Engine, EngineVersion (major version), DBSubnetGroupName
    AWS::EC2::Instance ImageId, InstanceType (sometimes), SubnetId
    AWS::S3::Bucket BucketName (can't change), AccessControl (sometimes)
    AWS::Lambda::Function Code (S3 bucket/key change)


    Fix:
    • Accept the replacement and plan for downtime
    • Use blue/green deployment for zero-downtime replacement
    • Modify the resource directly in AWS console (not recommended for IaC)


    Prevention:

    Always review change sets in staging before production. Know which properties cause replacement for your critical resources.



    Failure 12: Change set execution fails because of update conflicts

    Symptoms:
    • Change set creates successfully
    • execute-change-set fails
    • Error: "Cannot update stack because another update is in progress"


    Root cause:

    Another process (CI/CD pipeline, another engineer, scheduled automation) started a stack update while your change set was waiting for execution.


    Diagnosis:






    # Check current stack status
    aws cloudformation describe-stacks --stack-name prod-stack \
    --query "Stacks[0].StackStatus"

    # Status like UPDATE_IN_PROGRESS or ROLLBACK_IN_PROGRESS means locked







    Fix:

    Wait for the other update to complete. Then create a new change set based on the latest stack state. Do not execute the old change set — it's now out of date.






    # Delete old change set
    aws cloudformation delete-change-set --change-set-name old-change-set

    # Create new change set against current stack
    aws cloudformation create-change-set --stack-name prod-stack \
    --change-set-name new-change-set --template-body file://template.yaml

    # Execute fresh change set
    aws cloudformation execute-change-set --change-set-name new-change-set







    Prevention:
    • Implement stack-level locking via S3 condition keys or custom resources
    • Coordinate CI/CD pipelines to never deploy simultaneously to the same stack
    • Use separate stacks for separate environments





    Part 7: Performance and quota failures

    Failure 13: Stack deployment times out due to API rate limiting

    Symptoms:
    • Stack deployment slows dramatically after hundreds of resources
    • Error: "Rate exceeded" for various AWS APIs
    • Some resources take 5-10 retries before succeeding


    Root cause:

    CloudFormation makes many API calls to create resources. AWS APIs have rate limits. Large stacks hit these limits.


    Diagnosis:






    # Check CloudTrail for throttle errors
    aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=ThrottlingEx ception







    Fix — immediate:

    Split the stack. CloudFormation has a recommended limit of 200 resources per stack for optimal performance.






    # List resources by type to see distribution
    aws cloudformation list-stack-resources --stack-name large-stack \
    --query "StackResources[*].[ResourceType]" --output text | sort | uniq -c







    Fix — long term:

    Design modular stacks:






    network-stack.yaml (VPC, subnets, route tables)
    data-stack.yaml (RDS, ElastiCache, S3)
    compute-stack.yaml (ASG, launch templates)
    app-stack.yaml (Lambda, API Gateway)







    Prevention:

    Monitor stack creation time. If it exceeds 15 minutes for non-stateful resources, split the stack.



    Failure 14: Service quota exceeded during deployment

    Symptoms:
    • Deployment fails
    • Error: "You have reached your limit of X resources"


    Root cause:

    AWS account has default service limits. You're trying to create more resources than allowed.


    Common quotas:
    • VPCs per region: 5
    • Security groups per VPC: 500
    • RDS instances per region: 40
    • Lambda concurrent executions: 1000


    Diagnosis:






    # Check current usage against quota
    aws service-quotas get-service-quota \
    --service-code ec2 --quota-code L-12345678

    # List all quotas for a service
    aws service-quotas list-service-quotas --service-code rds







    Fix — immediate:

    Request quota increase from AWS Support or via Service Quotas API:






    aws service-quotas request-service-quota-increase \
    --service-code ec2 --quota-code L-12345678 \
    --desired-value 100







    Fix — tactical:

    Reduce resource count in the current deployment. Use smaller instance sizes. Share resources across stacks.


    Prevention:

    Include quota checks in your CI/CD pipeline before deployment:






    # Script to check quotas before deploying
    python scripts/check_quotas.py --template template.yaml










    Part 8: Troubleshooting workflow - where to start

    When a CloudFormation deployment fails, follow this workflow:


    Step 1: Get the raw error





    aws cloudformation describe-stack-events --stack-name prod-stack \
    --max-items 20 --query "StackEvents[?ResourceStatus=='CREATE_FAILED' || ResourceStatus=='UPDATE_FAILED']"







    Look for the ResourceStatusReason field. This is your primary clue.


    Step 2: Identify the failed resource

    The error message tells you which logical resource failed. Find its type and properties in your template.


    Step 3: Check if it's a known failure pattern

    "Role does not exist" IAM eventual consistency Part 1, Failure 1
    "Rate exceeded" API throttling Part 7, Failure 13
    "Limit exceeded" Service quota Part 7, Failure 14
    "Deletion protection" Rollback blocked Part 3, Failure 6
    "Another update in progress" Concurrent update Part 6, Failure 12


    Step 4: Deploy with --disable-rollback for debugging





    aws cloudformation create-stack --stack-name debug-stack \
    --template-body file://template.yaml \
    --disable-rollback







    Failed resources remain so you can inspect them directly.


    Step 5: Inspect the failed resource directly

    For EC2:






    aws ec2 describe-instances --instance-ids i-12345
    ssh ec2-user@instance-ip # Check logs







    For Lambda:






    aws logs describe-log-groups --log-group-name-prefix /aws/lambda/my-function
    aws logs get-log-events --log-group-name /aws/lambda/my-function --log-stream-name $(aws logs describe-log-streams --log-group-name /aws/lambda/my-function --query "logStreams[0].logStreamName" --output text)







    For RDS:






    aws rds describe-db-instances --db-instance-identifier mydb
    aws rds describe-events --source-identifier mydb --source-type db-instance







    Step 6: Fix, then continue

    If stack is in ROLLBACK_FAILED, you have two options:


    Option A: Delete the failed stack and recreate






    aws cloudformation delete-stack --stack-name prod-stack
    # Wait for deletion
    aws cloudformation create-stack --stack-name prod-stack --template-body file://template.yaml







    Option B: Continue rolling back after fixing the blocker






    # Fix the blocking resource (empty S3 bucket, disable deletion protection)
    # Then retry rollback (CloudFormation may need manual intervention via support)










    Production CloudFormation checklist

    Before deploying to production, verify:


    Drift detection
    • [ ] Enabled on all production stacks
    • [ ] Weekly automated drift check configured
    • [ ] Alerts configured for drift findings


    Rollback strategy
    • [ ] Stateful resources have DeletionPolicy: Retain or Snapshot
    • [ ] Stateless resources have DeletionPolicy: Delete
    • [ ] Stateful and stateless resources in separate stacks


    IAM and security
    • [ ] No "Action": "*" in policies
    • [ ] Secrets use {{resolve:secretsmanager:...}} not parameters
    • [ ] CI/CD role has minimal required permissions
    • [ ] cfn-guard or cfn-lint running in CI


    Failure handling
    • [ ] CreationPolicy includes timeout and signal handling
    • [ ] Custom resources always send SUCCESS or FAILURE responses
    • [ ] Nested stack depth ≤ 2


    Performance
    • [ ] No stack exceeds 200 resources
    • [ ] No stack consistently deploys longer than 15 minutes
    • [ ] Service quotas checked before deployment


    Troubleshooting readiness
    • [ ] describe-stack-events command documented in runbook
    • [ ] Access to failed resource logs (EC2, Lambda, RDS) available
    • [ ] --disable-rollback used in staging deployments





    Written by Onyedikachi Obidiegwu | Cloud Security Engineer




    More...
Working...