devopscicddeploymentreliability

Your Deployment Pipeline Is Clever. That's the Problem.

Reliable delivery pipelines optimize for repeatability and recoverability, not cleverness

October 30, 20247 min read

The cleverly designed pipeline is the one that fails in ways nobody expected.

The pipeline with five nested conditionals, dynamic environment selection, and a custom deployment script that "handles all the edge cases" — that's the pipeline that breaks at 11pm on a Friday when you need to push a hotfix and nobody remembers why the conditional is there or what it's supposed to protect against.

Deployment pipelines are infrastructure. Infrastructure should be boring. Boring infrastructure is infrastructure that works exactly the same way every time, fails loudly when something goes wrong, and can be debugged by someone who wasn't the original author.

Repeatability. Recoverability. Observability. Not cleverness.

This post covers what a reliable delivery pipeline for a containerized API looks like, and the specific failure modes that come from choosing complexity over simplicity.

I – Artifact Immutability

The foundation of a repeatable deployment: every artifact is immutable.

An immutable artifact is one that can be deployed, rolled back, and re-deployed without rebuilding it. The artifact deployed to production today is bit-for-bit identical to the artifact that was tested in CI yesterday.

Containers enable this naturally. An image tagged with a specific SHA is immutable. Deploying registry/service:sha256-abc123 means deploying exactly that image, regardless of when you deploy it.

The anti-pattern: deploying :latest. The :latest tag is mutable. Today's :latest is not tomorrow's :latest. Two deployments of :latest can deploy different code. If you need to roll back to "what was deployed yesterday," :latest can't help you — you don't know what yesterday's :latest contained.

Tag every image with a content-derived identifier:

# GitHub Actions
- name: Build and tag
  run: |
    SHA="${GITHUB_SHA::8}"
    docker build -t registry/service:${SHA} .
    docker push registry/service:${SHA}
    echo "IMAGE_TAG=${SHA}" >> $GITHUB_ENV

The short SHA is human-readable, unique per commit, and stable. A deployment record that says "deployed registry/service:a3f1b2c3 at 14:22:31 UTC" is auditable. A deployment record that says "deployed :latest" is not.

II – Container Provenance

For production systems, know what's inside every image you deploy.

Provenance means: the image was built from a specific commit, by a specific CI run, with specific dependencies, and that lineage is verifiable.

Labels on every image:

LABEL org.opencontainers.image.source="https://github.com/org/repo"
LABEL org.opencontainers.image.revision="${GIT_SHA}"
LABEL org.opencontainers.image.created="${BUILD_DATE}"
LABEL org.opencontainers.image.version="${VERSION}"

These labels are baked into the image at build time. When you pull an image and run docker inspect, the labels tell you exactly where the image came from.

Supply-chain controls: sign images with cosign. The signature proves the image was built by your CI system and hasn't been tampered with. This is particularly important if your registry is shared or if images are pulled from untrusted environments.

cosign sign --key cosign.key registry/service:${SHA}

Verify on deploy:

cosign verify --key cosign.pub registry/service:${SHA}

If verification fails, the deployment should abort. An unsigned image means the image was not built by your CI system. That's worth investigating before deploying.

III – Migration Execution in Deployment Flow

Database migrations run before the application starts. Not after. Not during. Before.

The reason is simple: the new application code expects the new schema. If the application starts before the migration runs, it will fail in unpredictable ways when it encounters the old schema.

The deployment sequence:

1. Push image to registry
2. SSH/SCP to deployment target
3. Pull new image
4. Run migrations (in a temporary container using the new image)
5. If migrations fail: abort, do not restart the application
6. Stop old container
7. Start new container with new image
8. Wait for health check to pass
9. Clean up old container

Step 5 is critical. If a migration fails, the old application is still running against the old schema. That's fine — the old code is compatible with the old schema. Stop here, investigate, fix the migration, and try again.

If you skip step 5 and start the new application before verifying migration success, you may start an application that expects a schema that doesn't exist yet. The application crashes. Now you need to roll back the application and the migration simultaneously, under pressure.

Make migrations fail early. Let them stop the deploy.

IV – Secrets Management in CI

Secrets in CI pipelines are a persistent security concern.

The wrong approach: secrets in environment variables at the repository level, available to all workflows. A malicious pull request from a fork can exfiltrate these secrets.

The right approach: secrets scoped to environments with protection rules.

jobs:
  deploy:
    environment: production  # Protected environment
    steps:
      - name: Deploy
        env:
          DEPLOY_KEY: ${{ secrets.DEPLOY_SSH_KEY }}
          REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
        run: ./scripts/deploy.sh

A "protected environment" in GitHub Actions requires reviewer approval before a workflow can access it. This prevents a compromised workflow from automatically accessing production secrets.

For extremely sensitive secrets (production database credentials, signing keys), prefer short-lived secrets with time-bounded access over long-lived static values. Platforms like HashiCorp Vault, AWS Secrets Manager, or GitHub's OpenID Connect integration with cloud providers can generate time-bounded credentials that are useless after the deployment completes.

V – Health Verification and Rollout Gates

Deploying successfully does not mean the service is healthy.

A deployment that passes the build, passes migrations, and starts the container is not necessarily a deployment that is serving requests correctly. The new code might have a runtime error that only appears when handling real traffic. The new migration might have created a performance regression that doesn't show up until queries start executing.

Health verification is the gate between "deployed" and "done":

# Health check loop
MAX_ATTEMPTS=30
INTERVAL=5

for i in $(seq 1 $MAX_ATTEMPTS); do
    STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)
    if [ "$STATUS" = "200" ]; then
        echo "Service healthy after $((i * INTERVAL))s"
        exit 0
    fi
    echo "Attempt $i: status $STATUS, waiting..."
    sleep $INTERVAL
done

echo "Service failed health check after $((MAX_ATTEMPTS * INTERVAL))s"
exit 1

If health checks fail, the deployment is not complete. The old container is still stopped. You're in a failed state. The on-call alert fires.

This is better than silently deploying a broken service and discovering it through user complaints. The deployment failure is visible, attributed to the deployment, and actionable immediately.

VI – Rollback Procedures

A rollback is a deployment of the previous image.

# Rollback to previous SHA
PREVIOUS_SHA=$(git rev-parse HEAD~1 | head -c 8)
./scripts/deploy.sh --image-tag="${PREVIOUS_SHA}"

Because images are immutable and tagged with SHAs, rollback is the same operation as deployment. The deploy script doesn't know or care whether the image tag is "new" or "previous." It deploys the specified image.

Rollback does not roll back migrations. Migrations are forward-only. This is why every migration must be backward-compatible with the previous application version. If you need to remove a column, do it in three steps: add the new column, migrate the application to not use the old column, then remove the old column in a separate migration. Never deploy code changes and destructive schema changes together.

The rollback decision tree:

Application is unhealthy after deploy?
    → Does the previous version still work with the new schema?
        YES → Roll back application. Leave migration.
        NO  → You have a problem. See "forward-fix" below.

Forward-fix: deploy a new version that fixes the bug
    → Faster than coordinating a schema rollback
    → Schema rollbacks require their own migration
    → Preferred for all but the most severe schema bugs

VII – What Breaks First

Build/deploy mismatch by tag drift. The CI build produces service:a3f1b2c3. The deployment script is configured to deploy service:latest. They diverge. The deployment pulls whatever :latest was last pushed, which may be from a different branch or a different feature. Fix: pass the image tag from the build step to the deploy step as an explicit parameter. Never use :latest in a deployment.

Migration partial failures. A migration fails on the 5th of 10 SQL statements. The first 4 have already executed. The database is in a partially-migrated state. The deployment script doesn't abort — it continues to start the application. The application fails on the half-migrated schema. Fix: every migration must be wrapped in a transaction. If any statement fails, the entire migration rolls back. The database returns to the pre-migration state. The deployment aborts.

Runtime failures after successful build. The Docker build succeeds. The container starts. The health check passes (it checks /ping which returns 200 regardless of application state). But the database connection string in the environment is wrong, and every real request fails with a 500. Fix: health checks must verify actual application function, not just process liveness. A health check that connects to the database and executes a test query catches this.

Deployment Checklist

Image tagged with git SHA, not :latest
Image signed with cosign
Migrations run and succeed before application starts
Deployment aborts if migrations fail
Health check verifies application function, not just liveness
Deployment verified healthy before marking complete
Rollback procedure tested (not just documented)
Deployment notification sent to monitoring channel

Rollback Runbook

Identify the last known-good image tag from deployment logs
Run ./scripts/deploy.sh --image-tag=${PREVIOUS_SHA}
Monitor health check output
Verify rollback succeeded: curl /health returns 200
Identify root cause of failed deployment
Fix forward — do not re-attempt the failed deployment without fixing the root cause
Deploy the fix as a new commit with a new image tag

The boring pipeline is the reliable pipeline. Every bit of cleverness you remove is an incident you won't have.

0 comments

I – Artifact Immutability #

II – Container Provenance #

III – Migration Execution in Deployment Flow #

IV – Secrets Management in CI #

V – Health Verification and Rollout Gates #

VI – Rollback Procedures #

VII – What Breaks First #

Deployment Checklist #

Rollback Runbook #