Your Retrieval Algorithm Doesn't Matter If Your Schema Is Wrong
Schema quality determines long-term correctness more than any algorithm choice
Teams building AI memory systems spend enormous energy on retrieval. Vector search strategies. Re-ranking models. Embedding quality. Fallback chains.
And then they ship a schema with ambiguous status fields, missing indexes, and no lifecycle columns — and spend the next year debugging behaviors that don't make sense, losing data they thought they were keeping, and migrating production tables under live load.
The schema is the foundation. Get it wrong and every retrieval optimization you add is building on sand.
This post covers what a production schema for a stateful AI system actually looks like, how to migrate it safely, and what the failure modes are when you skip the discipline.
I – Domain Decomposition Before Tables
Most schema mistakes happen before a single column is typed. They happen in the mental model.
A stateful AI system has six distinct domains. They need to be separated, not merged.
Memory (or context items). The actual stored content. Text, metadata, importance signals, status. This is the read-heavy core of the system.
Sources. Where the memory came from. A URL, a file, a conversation, a manual entry. Source is separate from content because one source can produce multiple memory items.
Artifacts. Binary or large-object storage references. A pointer to S3, not the blob itself. The blob lives in object storage. The pointer lives in Postgres.
Audit. Immutable event log. Who did what, when, from which identity. Never updated. Append-only. Separate table, separate retention policy.
Keys. API credentials. Lifecycle, scope, hashes. Sensitive enough to warrant its own table with restricted access patterns.
Usage. Metering and quota data. High-write, separate from the hot read path.
If any two of these live in the same table, you will pay for it. Audit mixed with memory means you can't delete memory without losing audit history. Usage mixed with keys means quota queries slow down authentication. The separation is not aesthetic — it's operational.
II – Lifecycle Columns That Actually Work
Every mutable entity needs lifecycle columns. Not some of them. All of them.
The minimum set:
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
expires_at TIMESTAMPTZ
deleted_at TIMESTAMPTZ -- soft delete
status TEXT NOT NULL DEFAULT 'active'
created_at and updated_at are non-negotiable. You will need them for debugging, for sorting, for migrations, for support tickets. Add them to every table from day one.
expires_at enables TTL-based cleanup without a status field. A background job selects WHERE expires_at < NOW() AND deleted_at IS NULL and marks them expired. Simple. Predictable. No state machine complexity.
deleted_at enables soft deletes. Soft deletes are important for audit integrity — you can't hard-delete a memory item and still answer "why did this recall fail?" Hard deletes destroy forensic capability.
status is where most schemas go wrong. The failure mode: status becomes a grab-bag string that means different things to different parts of the code.
A status field is only valid if:
- The set of valid values is enumerated (not free-text)
- The transitions between values are documented
- Every status value means exactly one thing
Use a database enum or a CHECK constraint. If you use text, add a migration the moment a new status value is added. Undocumented status fields are where bugs live indefinitely.
III – Constraints and Indexes That Match Your Query Paths
An index is not a performance optimization. It's a query contract. It says: this query path is supported, efficient, and will remain so under load.
The indexes you need are determined by your query patterns, not by gut feel.
For a memory system, the hot query paths are:
- Fetch all active memories for a project, ordered by importance and recency
- Vector similarity search filtered by project, status, and type
- Fetch a specific memory by ID and verify ownership
- Audit: fetch all events for a given actor in a time range
Each of these needs an index. The index needs to match the query's WHERE clause and ORDER BY clause exactly, or the query planner will ignore it.
-- Path 1
CREATE INDEX idx_memories_project_status_created
ON memories (project_id, status, created_at DESC)
WHERE deleted_at IS NULL;
-- Path 3
CREATE INDEX idx_memories_project_id
ON memories (project_id, id)
WHERE status = 'active';
-- Path 4
CREATE INDEX idx_audit_actor_created
ON audit_events (actor_id, created_at DESC);
Partial indexes (the WHERE deleted_at IS NULL clause) are especially valuable. They exclude soft-deleted records from the index entirely, keeping the index small and fast.
The test: after adding each index, run EXPLAIN ANALYZE on the real query. If it's not using the index, either the index is wrong or the query needs to change. Never assume.
IV – Migrations That Don't Break Production
A migration is a contract change. Once applied to production, it cannot be undone — only evolved forward.
Three rules for safe production migrations:
Additive before subtractive. Add columns, then migrate data, then drop old columns in a separate migration. Never add and drop in the same migration. Never rename columns in a single step (rename = drop + add from the application's perspective).
Non-blocking DDL. Adding a column with a default value in Postgres before version 11 rewrites the entire table. Adding a NOT NULL column without a default blocks reads and writes. Know your database version. Know which operations are blocking. Use ADD COLUMN ... DEFAULT NULL first, backfill, then tighten the constraint.
Idempotent execution. Every migration script must be safely re-runnable. CREATE INDEX IF NOT EXISTS. ALTER TABLE ... ADD COLUMN IF NOT EXISTS. If a CI runner applies the migration twice (it will, eventually), nothing should break.
The migration tooling must enforce ordering. Numbered migrations (001, 002, 003...) in a linear sequence. No branching. No "this can run in any order." The order of schema changes is part of the schema.
V – Online Migrations for Live Systems
The dangerous migration: you need to backfill a new column on a table with 10 million rows while the table is taking writes.
The wrong approach: UPDATE memories SET new_column = compute(old_column). This locks rows. It blows up replication lag. It takes 45 minutes and your replication falls behind. This has ended on-call rotations for engineers.
The right approach is a staged online migration:
- Add the column, nullable, no default. No data written yet.
- Dual-write. Application writes to both old and new column on every insert/update. Read from old column.
- Backfill in batches. Background job processes rows in chunks of 1,000-5,000.
WHERE new_column IS NULL LIMIT 5000. Sleep 100ms between batches. Take hours, not minutes. Don't care. - Cutover. Once backfill is complete and dual-write has run long enough, switch reads to the new column.
- Cleanup. Drop the old column in a follow-up migration after the code change is fully deployed.
This is more steps. It's also the only way to do it without an outage.
VI – Rollback Reality
Here is the truth about migration rollbacks: you will not use them.
You will not roll back a migration that added a column to a production table. Rolling back a migration means removing that column. The application code that was just deployed is writing to that column. If you roll back the migration, the column disappears, and the application crashes.
The realistic rollback is the application. Not the migration.
Design migrations so the new schema is backward-compatible with the old application version. Add the column. Deploy the app. The old app ignores the new column. The new app uses it. If the deploy fails, roll back the app. The migration stays. The schema is now slightly ahead of the application, and that is fine.
The rule: every migration must be safe to apply before the corresponding application change is deployed.
VII – What Breaks First
Migration drift across environments. Production is on migration 47. Staging is on migration 43. A bug only reproducible on migration 44-47 cannot be debugged on staging. Staging and production must be on the same schema. Always. Automate the enforcement.
Missing index on a status check. A query filtering by status = 'active' without an index does a full table scan. At 100 rows, this is invisible. At 1 million rows, it's a P0. Add indexes when you add status columns, not when the incident happens.
Ambiguous status fields. A status field with values like pending, processing, done, failed, error, cancelled, archived, and soft_deleted — where failed and error are subtly different in ways only one engineer knows. This is how bugs live for years. Enumerate the status values. Document the transitions. Add a database constraint.
Migration Runbook
- Review migration for blocking DDL operations
- Estimate table size and estimate migration runtime
- Apply to staging and measure actual runtime
- Deploy to production during low-traffic window if blocking
- Monitor replication lag and table lock metrics during apply
- Verify row counts and index usage after apply
- Update schema documentation
Rollback Decision Tree
Migration failed mid-apply?
→ Is the migration idempotent?
YES → Re-run the migration
NO → Assess partial state, roll forward manually
Application is crashing after deploy?
→ Roll back the application deployment
→ Migration stays applied
→ Fix the application to be compatible with new schema
Data corruption detected?
→ Stop writes immediately
→ Assess from audit log what changed and when
→ Restore from backup to point-in-time before migration
→ This is the nuclear option. Use it knowingly.
Schema discipline is not glamorous. It does not show up in a demo. It shows up at 2am when a migration has been running for 45 minutes and replication lag is climbing and you need to decide whether to kill it or let it finish.
Make those decisions before you need to make them.
0 comments