backendarchitecturesystemsreliability

Your Database Is Not Cleaning Itself

Why lifecycle integrity in stateful systems depends on explicit, observable maintenance jobs

November 29, 20237 min read

Somewhere in your database, there are records that should not be there.

Expired content that was never cleaned up. Orphaned artifacts that no longer have a parent record. Stale verification states that haven't been updated in months. Soft-deleted records that were meant to be purged weeks ago.

They're not causing visible problems. Yet. Your queries still work. Your storage costs are slightly higher than they should be, but nothing is alerting. The system appears healthy.

This is the silent failure mode of stateful systems. Not dramatic incidents. Just gradual degradation. State that was supposed to be temporary becomes permanent. Cleanup that was supposed to happen didn't. The gap between what the system thinks it contains and what it actually contains grows slowly, until one day a query returns stale data, a retrieval surfaces expired content, or storage costs spike without warning.

None of this happens if you design lifecycle maintenance as a first-class system.

I – Why Cleanup Belongs in Background Jobs

The instinct is to handle cleanup inline. When a user deletes a project, cascade-delete all its records. When content expires, mark it at query time if the expiry date has passed.

Both approaches fail at scale.

Inline cascade deletion turns a user action into a potentially unbounded database operation. A project with 50,000 records takes seconds to delete inline. The user's request times out. You've now created a partial deletion state that's harder to clean up than the original problem.

Query-time expiry checks mean your queries must always filter by expires_at. Every query carries this predicate. If you forget it once — in a new endpoint, in a migration script, in a manual query — you return expired content. The invariant is not enforced by the system; it's enforced by discipline.

Background jobs solve both problems. Cleanup is an async, bounded operation that doesn't block user-facing requests. Expired records are marked or deleted by the job, not by the query. The query can then use a simple status = 'active' filter without needing to reason about expiry timestamps.

The job is the enforcement mechanism. The query becomes simpler because the job is doing the hard work.

II – The Job Catalog

A stateful AI memory system needs at minimum four maintenance jobs:

Expiry job. Selects records where expires_at < NOW() AND status = 'active', transitions them to status = 'expired'. Runs every 15 minutes. This job never deletes — it only transitions state.

Orphan cleanup job. Selects artifact records where the parent memory record has been soft-deleted. Marks artifacts for deletion, then deletes the corresponding objects from object storage. Runs every hour. This job can delete but only after verifying the parent record status.

Verification staleness job. Selects records where last_verified_at < NOW() - INTERVAL '7 days' and the record is still active. Transitions their verification status to stale. Sends a signal to any webhook or notification system. Does not delete. Runs daily.

Hard delete job. Selects records where deleted_at < NOW() - INTERVAL '30 days'. Permanently removes them from the database. This is the only job that hard-deletes. It runs after a 30-day soft-delete window to allow recovery. Runs daily.

Four jobs. Each with a single, well-defined responsibility. Each with clear criteria for which records it processes.

III – Idempotency Is Not Optional

A background job that is not idempotent will eventually cause a double-processing incident.

Clock drift, deployment overlaps, retry logic, manual re-runs after failures — all of these can cause a job to run twice in overlapping windows. If the job is not idempotent, records get double-processed. State transitions get applied twice. Artifacts get deleted from storage twice, causing errors on the second delete.

Idempotency for transition jobs is straightforward. Check the current state before transitioning:

UPDATE memories
SET status = 'expired', expired_at = NOW()
WHERE status = 'active'
  AND expires_at < NOW()
  AND expires_at IS NOT NULL

This query is naturally idempotent — it only transitions records that are still in active state. Running it twice has no effect the second time.

Idempotency for delete jobs requires tracking. Use a deletion_queued_at column. The job sets this on its first pass. On retry, it only processes records where deletion_queued_at IS NOT NULL AND deletion_queued_at < NOW() - INTERVAL '5 minutes' — meaning the queue flag was set at least 5 minutes ago, not in this same run.

IV – Concurrency Limits and Lock Strategies

Background jobs that process large datasets should not run unconstrained. A job that tries to process 100,000 records simultaneously creates database lock contention, spikes memory usage, and interferes with user-facing queries.

Process in bounded batches:

func runExpiryJob(ctx context.Context) error {
    batchSize := 500
    processed := 0

    for {
        n, err := expireBatch(ctx, batchSize)
        if err != nil {
            return err
        }
        processed += n

        if n < batchSize {
            break  // last batch
        }

        time.Sleep(100 * time.Millisecond)  // yield between batches
    }

    emit("expiry_job_complete", fields{"processed": processed})
    return nil
}

The sleep between batches yields time to user-facing queries. It also makes the job's database impact predictable — you can measure queries per second and tune the sleep interval to stay within safe bounds.

For jobs that must not overlap with themselves, use a distributed lock. Postgres advisory locks work well for single-region deployments:

SELECT pg_try_advisory_lock(12345);  -- returns false if already locked

Acquire the lock at job start. Release on completion or error. If the lock cannot be acquired, the previous run is still in progress — skip this run.

V – Job-Level Telemetry

Silent failures are the worst kind.

A background job that errors silently appears healthy. The cleanup doesn't happen. Data accumulates. The failure is discovered weeks later during an incident, not during routine monitoring.

Every job run should emit structured telemetry:

{
  "event": "expiry_job_run",
  "job": "expiry",
  "started_at": "2024-03-01T02:00:00Z",
  "completed_at": "2024-03-01T02:00:12Z",
  "records_processed": 1247,
  "records_transitioned": 34,
  "errors": 0,
  "status": "success"
}

Alert on:

Any job that has not run successfully in 2x its scheduled interval
Any job run with errors > 0
Any job that processed 0 records when the previous 7 runs processed > 0 records (might indicate a job regression)

The zero-records alert is undervalued. A job that transitions 0 records might mean everything is healthy, or it might mean the job's query has a bug and it can't find any records to process. The trend over time distinguishes the two cases.

VI – Manual Override and Backfill Tools

Scheduled jobs operate on their normal cadence. Sometimes you need to force a run.

After a bug is fixed that caused records to not be expired correctly, you need to backfill all the records that should have been expired but weren't. The scheduled job will eventually catch them, but you want it done now.

Build override tooling for every job:

ctx-admin jobs run expiry --project-id=proj_abc --dry-run
ctx-admin jobs run expiry --project-id=proj_abc
ctx-admin jobs run hard-delete --before=2024-01-01 --dry-run

--dry-run returns what the job would process without making changes. This is essential — you want to verify the scope before running a bulk operation.

--project-id scopes the backfill to a single project. This is safer than running against the full dataset.

These tools are operations tooling. They're not glamorous. They will save you hours during incidents.

VII – What Breaks First

Duplicate processing. Two instances of the expiry job run simultaneously (deployment overlap, scheduler bug). Both select the same 500 records. Both transition them to expired. The second run hits records already in expired state, does nothing, finishes cleanly. If idempotency is correct, this is fine. If it's not — if the job uses INSERT instead of UPDATE and doesn't check current state — you get duplicate audit events, doubled metrics, and confused state.

Orphaned storage growth. Artifact records are soft-deleted. The background job runs and hard-deletes the database records. But the corresponding objects in S3 are never deleted because the orphan cleanup job ran before the hard-delete job this cycle. The S3 objects accumulate. After 6 months, your storage costs have doubled and nobody knows why. Fix: orphan cleanup should query soft-deleted artifact records, not yet-hard-deleted ones. Run orphan cleanup before hard delete in the job sequence.

Silent job failures. The expiry job encounters a Postgres connection error on its 4th batch. It logs the error. It exits. The scheduler sees a completed run and schedules the next one for tomorrow. Nobody gets alerted. 10,000 records that should have been expired aren't. The job has been silently failing for two weeks. Fix: structured telemetry with error alerting. A job that errors is not a completed job.

Job Catalog Template

Job	Trigger	Batch Size	Idempotent	Lock
Expiry	Every 15 min	500	Yes (status check)	Advisory
Orphan cleanup	Every 1 hour	100	Yes (deletion_queued_at)	Advisory
Verification staleness	Daily 02:00	1000	Yes (last_verified_at)	Advisory
Hard delete	Daily 03:00	250	Yes (deleted_at window)	Advisory

Idempotency Checklist

Job queries filter by current state, not target state
Transitions only apply to records in the expected source state
No INSERTs without conflict handling
DELETE operations verify soft-delete status before executing
Running the job twice produces the same final state as running it once

The database will not clean itself. Build the jobs. Monitor the jobs. Run them explicitly when you need to. Treat them as production systems, not afterthoughts.

0 comments

I – Why Cleanup Belongs in Background Jobs #

II – The Job Catalog #

III – Idempotency Is Not Optional #

IV – Concurrency Limits and Lock Strategies #

V – Job-Level Telemetry #

VI – Manual Override and Backfill Tools #

VII – What Breaks First #

Job Catalog Template #

Idempotency Checklist #