Infrastructure Archaeology: Diagnosing Multi-Layer CI/CD Failures

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5168

    #1

    Infrastructure Archaeology: Diagnosing Multi-Layer CI/CD Failures

    The Pattern


    Modern cloud infrastructure often evolves through incremental additions.

    A team starts with basic CI/CD, adds Terraform for IaC, integrates

    security scanning, sets up monitoring—each piece works in isolation,

    but the system as a whole becomes fragile.


    Here's a failure pattern I've observed across multiple production

    GCP environments: what appears to be "a few broken configs" is actually

    a multi-layer architectural problem spanning Docker, Terraform, GitHub

    Actions, and cloud-native security tooling.


    Let's dissect it.


    DISCLAIMER: All code examples, project names, domains, and configurations in this article are sanitized examples for educational purposes. No real client data or proprietary information is exposed. This analysis is based on publicly available documentation and common infrastructure patterns.




    The Symptom List


    In this pattern, teams typically surface a cluster of related failures:


    Build & Container Issues:

    1. Docker multi-stage build misconfigurations — CI/CD pipelines reference non-existent stage names in Dockerfiles
    2. Duplicate or conflicting CMD instructions — containers exhibit unpredictable startup behavior
    3. Image scanning pipeline breaks — security tools block pushes but jobs still succeed


    Infrastructure-as-Code Failures:

    1. Terraform module reference errors — output files reference modules that don't exist in the configuration
    2. Variable interface mismatches — calling code passes variables that modules don't accept
    3. Wrong execution context — CI runs IaC commands in incorrect directories
    4. Provider version drift — different environments use incompatible provider versions


    CI/CD Architecture Gaps:

    1. Missing deployment automation — builds succeed but nothing triggers actual deployments
    2. No quality gates — tests and builds run in parallel; failures don't block progression
    3. Hardcoded deployment paths — only specific branches trigger deploys; others require manual intervention
    4. Configuration drift — production URLs and domains missing from automation config


    Security Tooling Integration Conflicts:

    1. Overlapping vulnerability detection — Trivy, GCP Container Analysis, and Security Command Center all scan the same images
    2. Runtime security false positives — Falco rules trigger on legitimate Cloud Run startup syscalls
    3. Fragmented security reporting — findings appear in multiple systems with no single source of truth
    4. Policy enforcement gaps — security scans run but don't actually block deployments


    Tech stack representative of this pattern: GitHub Actions, GCP Cloud Run, Artifact Registry, Terraform, Firebase Hosting, containerized microservices with pnpm/npm monorepo structure.


    Seems like a lot of small fixes, right? The reality is more complex.




    What I Actually Found: The 3-Layer Problem


    These aren't isolated bugs. They're symptoms of failures at three distinct levels.


    Layer 1: The Obvious (Syntax & Configuration Errors)


    These are the errors you see immediately when you run the tools:


    Docker Target Mismatch:






    # Dockerfile declares:
    FROM node:20-alpine AS runner

    # GitHub Action requests:
    with:
    target: app # ❌ Stage "app" doesn't exist







    Terraform Module Reference:






    # outputs.tf tries to reference:
    output "api_url" {
    value = module.cloud_run_api.service_url # ❌ Module doesn't exist
    }

    # main.tf actually has:
    module "api_service" { # Different name!
    source = "../../modules/cloud-run"
    }







    Variable Name Mismatch:






    # envs/prod/main.tf sends:
    module "api" {
    service_name = "api-prod" # ❌ Module doesn't accept this
    }

    # modules/cloud-run/variables.tf expects:
    variable "name" { # Different variable!
    type = string
    }







    These are language and consistency errors. Terraform requires that any resource or module referenced in output files be explicitly declared in the active configuration. When you refactor and change module names in main.tf but forget to update outputs.tf, you get this.


    The fix? Run terraform validate — it catches these immediately without even connecting to the cloud.


    Layer 2: Platform Changes (Hidden Causes)


    This is where it gets interesting. Some failures aren't in the code — they're in how GCP's platform has evolved.


    GCP Service Account Permission Changes:


    GCP recently changed how Cloud Build uses service accounts. What used to work automatically now fails because the build service account no longer has default permissions to write logs or read from Artifact Registry.


    The missing piece: iam.serviceaccounts.actAs permission, required for one identity to assume the role of a runtime service account.


    Organization Policy Restrictions:


    That "Firebase region conflict" isn't a typo in your Terraform. It's a collision with constraints/gcp.resourceLocations — an organization policy that blocks deployments to certain regions, even if your Terraform syntax is perfect.


    VPC Service Controls:


    If the project sits inside a VPC Service Controls perimeter, Cloud Run deployments can fail silently with confusing 403/404 errors. The perimeter blocks communication between Google services — like the Cloud Run agent trying to read images from Artifact Registry.


    Security Tooling Conflicts:


    When security tools are added incrementally — each solving a specific

    problem in isolation — they create overlapping responsibilities and

    contradictory enforcement policies.


    A typical pattern:
    • Trivy is added to CI to scan container images before push
    • Falco is added to monitor runtime behavior in Cloud Run
    • GCP Container Analysis API scans images automatically on push
      to Artifact Registry
    • Security Command Center aggregates findings across the project


    Each tool works. The integration doesn't.


    The failure cascade:

    1. Trivy finds a CVE and is configured to block the push
    2. The GitHub Action reports success anyway (exit code not wired correctly)
    3. Image gets pushed to Artifact Registry
    4. Container Analysis API scans the same image 10 minutes later
    5. Falco triggers alerts on normal Cloud Run startup syscalls
      (false positive)
    6. Security Command Center reports the same CVE 3 hours later
    7. Three different alerting systems fire
    8. No one knows which finding to trust or act on first


    Root cause: No centralized security policy. Each tool was added

    without defining ownership, enforcement boundaries, or a single

    source of truth for findings.


    The hidden cost: Security tools that don't actually gate deployments give a false sense of protection. The pipeline feels secure. It isn't.


    GCP Resource Name Limits:


    GCP has a 63-character limit for resource names. If your Terraform generates names that exceed this (long prefixes like baseInstanceName), the system truncates them, causing duplicate name conflicts and deployment failures.


    These aren't bugs in your code. They're platform governance and technical constraints that interact badly with naive configurations.


    Layer 3: Architectural Debt (The Root Problem)


    The deepest layer isn't about syntax or permissions — it's about missing architecture.


    No CI/CD Gates:


    The build and CI workflows are decoupled. Tests can fail, but images still get built and pushed. There's no needs: dependency chain enforcing that tests pass before builds run.






    # What's happening:
    jobs:
    test:
    runs-on: ubuntu-latest
    build:
    runs-on: ubuntu-latest # ❌ Runs in parallel, doesn't wait for tests







    Wrong Directory Context:


    GitHub Actions runs terraform plan in the repository root instead of envs/staging/. Terraform is directory-dependent — without the right context, it validates an empty or incomplete configuration.


    Hardcoded Feature Branch:


    Only one deployment path works: a specific feature branch → staging. There's no development → staging automation, no main → production workflow. Everything else is manual.


    Missing Environment Variables:


    Production URLs and domains aren't defined anywhere in the automation. Cloud Run services deploy without knowing their actual domain mappings, leaving SSL certificates stuck in provisioning or external access failing with 404/502.


    This is lifecycle orchestration failure. Someone built pieces that "worked" in isolation but never architected how they fit together.








    Why Fixing Order Matters


    You can't just "fix what's broken." Here's why sequence matters:


    Fix production Terraform first → Staging still broken, can't test changes


    Wire up CI gates first → Builds still fail, nothing to gate


    Add domain configs first → Deployments fail before they even reach the domain mapping step


    Fix build errors → then CI validation → then deployment automation → then configuration gaps


    Think of it like renovating a house: you can't install the roof if the foundation is cracked. You can't paint the walls if the plumbing leaks.


    The Remediation Strategy:


    Day 1-2: Fix blocking issues (foundation)


    Day 3-4: Wire up automation (plumbing)


    Day 5: Clean up medium issues (finishing touches)


    This bottom-up approach ensures each layer is stable before building on top of it.




    How to Actually Fix This


    Issue #1: Docker Target Mismatch


    Quick diagnosis:






    grep "AS " apps/api/Dockerfile # See what stage names actually exist
    grep "target:" .github/workflows/*.yml # See what CI requests







    The fix:






    # Option A: Fix the composite action (recommended)
    # .github/actions/build-push/action.yml
    - name: Build and push
    uses: docker/build-push-action@v5
    with:
    target: runner # ✅ Match Dockerfile stage name











    # Option B: Fix the Dockerfile
    FROM node:20-alpine AS app # ✅ Match action target







    Why it works: Docker multi-stage builds use FROM ... AS to label stages. The --target flag tells Docker which stage to stop at. Mismatched names = build failure.





    Issue #2: Staging Terraform Undefined Module


    Quick diagnosis:






    cd envs/staging
    grep -n "module\." outputs.tf # Find all module references
    grep -n 'module "' main.tf # Find all module declarations
    # Names must match exactly







    The fix:






    # outputs.tf (BEFORE)
    output "api_url" {
    value = module.cloud_run_api.service_url # ❌
    }

    # outputs.tf (AFTER)
    output "api_url" {
    value = module.api_service.service_url # ✅ Match actual module name
    }







    Validation:






    terraform init
    terraform validate # Must pass
    terraform plan # Should show changes, not errors







    Why it works: Terraform's output system requires module references to exist in the configuration. This is caught during the validation phase, which checks internal consistency without cloud access.





    Issue #3: Production Variable Mismatch


    Quick diagnosis:






    # Check what the module expects
    cat modules/cloud-run/variables.tf

    # Check what production sends
    grep -A 10 'module "api"' envs/prod/main.tf







    The fix:






    # envs/prod/main.tf (BEFORE)
    module "api" {
    source = "../../modules/cloud-run"
    service_name = "api-prod" # ❌ Module doesn't have this variable
    container_port = 8080 # ❌
    }

    # envs/prod/main.tf (AFTER)
    module "api" {
    source = "../../modules/cloud-run"
    name = "api-prod" # ✅ Match module's variable.tf
    port = 8080 # ✅
    }







    Why it works: Terraform modules define a contract through variables.tf. The calling code must provide values that match these declared variables. Interface mismatches halt plan generation.





    Issue #4: Wrong Directory in CI


    Quick diagnosis:






    # Check if workflow sets working directory
    grep -A 5 "defaults:" .github/workflows/terraform-ci.yml







    The fix:






    # .github/workflows/terraform-ci.yml (BEFORE)
    jobs:
    validate:
    runs-on: ubuntu-latest
    steps:
    - run: terraform init # ❌ Runs in repo root

    # .github/workflows/terraform-ci.yml (AFTER)
    jobs:
    validate:
    runs-on: ubuntu-latest
    defaults:
    run:
    working-directory: ./envs/staging # ✅ Set context
    steps:
    - run: terraform init # Now runs in correct directory







    Why it works: Terraform is context-dependent. Without explicit directory specification, commands run in $GITHUB_WORKSPACE (repo root), where no .tf files exist for the specific environment.





    Issue #5-6: Missing Deployment Automation


    Create: .github/workflows/deploy-staging.yml






    name: Deploy to Staging

    on:
    push:
    branches:
    - development
    paths:
    - 'apps/**'
    - 'packages/**'

    jobs:
    test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Setup Node
    uses: actions/setup-node@v4
    with:
    node-version: '20'
    cache: 'pnpm'
    - run: pnpm install
    - run: pnpm test
    - run: pnpm lint

    build:
    needs: test # ✅ Only runs if tests pass
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Auth to GCP
    uses: google-github-actions/auth@v2
    with:
    workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
    service_account: ${{ secrets.GCP_SA_EMAIL }}
    - name: Build API
    uses: ./.github/actions/build-push
    with:
    dockerfile: apps/api/Dockerfile
    image: us-central1-docker.pkg.dev/${{ secrets.GCP_PROJECT }}/images/api
    tag: staging-${{ github.sha }}
    build-target: runner # ✅ Fix for issue #1

    deploy:
    needs: build # ✅ Only runs if build succeeds
    runs-on: ubuntu-latest
    defaults:
    run:
    working-directory: ./envs/staging
    steps:
    - uses: actions/checkout@v4
    - name: Auth to GCP
    uses: google-github-actions/auth@v2
    with:
    workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
    service_account: ${{ secrets.GCP_SA_EMAIL }}
    - name: Setup Terraform
    uses: hashicorp/setup-terraform@v3
    - run: terraform init
    - run: terraform plan -var="image_tag=staging-${{ github.sha }}" -out=tfplan
    - run: terraform apply -auto-approve tfplan







    Why it works: The needs: keyword creates job dependencies. GitHub Actions won't run build until test succeeds, won't run deploy until build succeeds. This is the "gating" that was missing.





    Issue #7: CI Doesn't Gate Deployments


    Already solved in Issue #5-6. The key is the needs: chain:






    test → build → deploy







    Each step must complete successfully before the next begins.





    Issue #8: URL Configuration Gaps


    Create centralized config:






    # envs/staging/terraform.tfvars
    project_id = "myproject-staging"
    region = "us-central1"

    domains = {
    api = "api-staging.myapp.com"
    web = "staging.myapp.com"
    }







    Use in module:






    # modules/cloud-run/main.tf
    resource "google_cloud_run_service" "service" {
    name = var.name
    location = var.region

    template {
    spec {
    containers {
    image = var.image

    env {
    name = "API_URL"
    value = "https://${var.api_domain}"
    }
    env {
    name = "WEB_URL"
    value = "https://${var.web_domain}"
    }
    }
    }
    }
    }

    resource "google_cloud_run_domain_mapping" "domain" {
    location = var.region
    name = var.custom_domain

    spec {
    route_name = google_cloud_run_service.service.name
    }
    }







    Update GitHub Secrets:






    gh secret set STAGING_API_URL --body "https://api-staging.myapp.com"
    gh secret set STAGING_WEB_URL --body "https://staging.myapp.com"







    Why it works: Cloud Run requires domain validation and DNS configuration. Without these URLs in Terraform, the platform can't set up SSL certificates or route external traffic correctly.


    Issues #12-15: Security Tooling Integration Conflicts


    Quick diagnosis:






    # Check if Trivy actually fails the job on findings
    grep -A 10 "trivy" .github/workflows/*.yml
    # Look for: exit-code: '1' and severity threshold

    # Check for duplicate scanning
    grep -r "scan\|trivy\|falco\|vulnerability" .github/workflows/*.yml

    # Check Falco rules for Cloud Run compatibility
    cat falco-rules/custom-rules.yaml | grep -i "container\|syscall"

    # Check if Container Analysis is enabled
    gcloud services list --enabled | grep containeranalysis







    The fix — Option A: GCP Native (simpler):


    Consolidate on GCP's built-in security tooling and remove

    redundant third-party tools:






    # .github/workflows/deploy-staging.yml
    jobs:
    security-scan:
    needs: build
    runs-on: ubuntu-latest
    steps:
    - name: Scan image with Trivy
    uses: aquasecurity/trivy-action@master
    with:
    image-ref: ${{ env.IMAGE_TAG }}
    format: 'sarif'
    exit-code: '1' # ✅ Actually fails the job
    severity: 'CRITICAL,HIGH'
    output: 'trivy-results.sarif'

    - name: Upload results to Security Command Center
    uses: github/codeql-action/upload-sarif@v2
    with:
    sarif_file: 'trivy-results.sarif'

    deploy:
    needs: security-scan # ✅ Deploy only if scan passes
    runs-on: ubuntu-latest
    steps: [...]











    # modules/cloud-run/main.tf
    # Use GCP Binary Authorization instead of Falco for deploy-time enforcement
    resource "google_binary_authorization_policy" "policy" {
    project = var.project_id

    default_admission_rule {
    evaluation_mode = "REQUIRE_ATTESTATION"
    enforcement_mode = "ENFORCED_BLOCK_AND_AUDIT_LOG"

    require_attestations_by = [
    google_binary_authorization_attestor.trivy_passed. name
    ]
    }
    }







    The fix — Option B: Trivy + Falco (more control):


    Keep both tools but define clear ownership boundaries:






    # Trivy owns: pre-deploy image scanning (CI gate)
    # Falco owns: runtime anomaly detection (post-deploy monitoring)
    # Security Command Center owns: compliance reporting (audit trail)
    # Container Analysis: disabled (redundant with Trivy)

    # .github/workflows/deploy-staging.yml
    jobs:
    scan:
    needs: build
    runs-on: ubuntu-latest
    steps:
    - name: Trivy scan
    uses: aquasecurity/trivy-action@master
    with:
    image-ref: ${{ env.IMAGE_TAG }}
    exit-code: '1' # ✅ Hard gate
    severity: 'CRITICAL'
    ignore-unfixed: true # Reduce noise

    deploy:
    needs: scan # ✅ Trivy must pass
    runs-on: ubuntu-latest
    steps: [...]











    # falco-rules/cloud-run-rules.yaml
    # Tune Falco to ignore Cloud Run startup behavior
    - rule: Unexpected syscall in container
    desc: Detect anomalous syscalls at runtime
    condition: >
    spawned_process and container
    and not proc.name in (cloud_run_allowed_processes)
    and not container.image.repository contains "gcr.io/cloudrun"
    output: "Unexpected process %proc.name in %container.name"
    priority: WARNING

    - macro: cloud_run_allowed_processes
    condition: >
    proc.name in (node, python, java, nginx, sh, bash)
    and not proc.cmdline contains "curl metadata" # Block SSRF attempts







    Fix for Security Command Center duplicate findings:






    # Disable Container Analysis if using Trivy (avoid duplicates)
    gcloud services disable containeranalysis.googleapis.com

    # OR: Configure SCC to deduplicate findings
    gcloud scc settings update \
    --organization=YOUR_ORG_ID \
    --enable-asset-discovery







    Why it works: Each security tool has a defined role with clear

    enforcement boundaries. Trivy gates at build time. Falco monitors

    at runtime. Security Command Center handles compliance reporting.

    No overlaps, no gaps, no false sense of security.


    The architectural principle:

    Security tools should be additive in coverage, not redundant in scope.




    Common Gotchas During Remediation


    🚩 "I fixed the Dockerfile but CI still fails"


    → Check if the composite action caches the old target name. Clear workflow cache or update the action's default input.


    🚩 "Terraform validate passes but plan fails"


    → You're probably in the wrong directory. Check pwd in your CI logs and verify working-directory is set.


    🚩 "Images build but Cloud Run deployment fails"


    → Service account permissions (Layer 2). Run:






    gcloud projects get-iam-policy YOUR_PROJECT \
    --flatten="bindings[].members" \
    --filter="bindings.members:serviceAccount:*@cloudbui ld.gserviceaccount.com"







    🚩 "Firebase deployment fails with region conflict"


    → Check org policy:






    gcloud resource-manager org-policies describe \
    constraints/gcp.resourceLocations \
    --project=YOUR_PROJECT







    🚩 "Variables are undefined in running container"


    → Don't put them in the Dockerfile. Inject via Terraform's env blocks in the Cloud Run service definition.


    🚩 "Trivy scan passes but vulnerable images still get deployed"

    → Check exit-code configuration. Trivy reports findings by default

    but doesn't fail the job unless exit-code: '1' is explicitly set

    with a severity threshold.


    🚩 "Falco generates hundreds of alerts on Cloud Run startup"

    → Cloud Run has a specific startup sequence that triggers generic

    Falco rules. Add Cloud Run-specific macros to your custom rules

    to filter legitimate startup behavior.


    🚩 "Security Command Center shows the same CVE from 3 different sources"

    → You have overlapping scanners. Decide on a single source of truth

    (Trivy OR Container Analysis, not both) and disable the redundant one.


    🚩 "Binary Authorization blocks deployment after security scan passes"

    → The attestor isn't linked to your Trivy results. The attestation

    step must explicitly create a Binary Authorization attestation after

    a successful scan.





    What This Analysis Doesn't Cover


    If this was real infrastructure, you would need to check the next points:
    • Terraform state drift (manual changes in GCP)
    • Networking/DNS configuration details
    • Secret management implementation
    • The full history of how the system reached this state


    But: For declared issues, these are all the documented root causes according to official Terraform, Docker, GitHub Actions, and GCP documentation.


    Think of this as: symptoms → probable diagnosis. The real fix needs hands on the actual system.





    Visual: The 3-Layer Problem





    Fix bottom-up, not top-down.





    Conclusion


    Infrastructure failures rarely have a single cause. What looks like "broken Terraform" is usually a combination of:
    • Configuration errors (Layer 1)
    • Platform evolution you didn't track (Layer 2)
    • Missing architectural decisions (Layer 3)


    The fix isn't just correcting syntax — it's understanding how these layers interact and building a system that's resilient to change.


    Key takeaways:

    1. Diagnose in layers. Don't stop at the obvious errors.
    2. Fix in order. Foundation before plumbing before paint.
    3. Build in gates. Make it impossible for broken code to reach production.
    4. Document decisions. Future you (or the next developer) needs context.
    5. Scope honestly. Complex infrastructure work takes time. Price accordingly.


    The goal isn't just to fix what's broken today — it's to build a system that won't break the same way tomorrow.




    More...
Working...