databasepostgresairetrieval

pgvector Is Not a Shortcut. It's a Database Engineering Problem.

What it actually takes to run Postgres with pgvector reliably in production

April 3, 20247 min read

The pitch for pgvector is appealing: you already have Postgres, just install the extension, add a vector column, and you have a vector database. One less service to run. One less thing to operate.

That pitch is true in development. In production, it falls apart if you treat pgvector like a feature flag rather than a database engineering decision.

Teams that succeed with pgvector in production treat it exactly like they treat any other demanding Postgres workload: they understand the index types, they manage vacuum carefully, they plan for connection pool contention, and they have a migration path when the extension can no longer keep up.

Teams that fail with pgvector install the extension, add a column, skip the index, and wonder why their similarity queries are doing full table scans at 500ms each.

This post is about the former.

I – When pgvector Is the Right Choice

pgvector is not the right choice for every system. Know the decision criteria.

Use pgvector when:

You have an existing Postgres deployment and want to avoid running another service
Your embedding table has fewer than ~5 million rows
You don't need sub-10ms query latency at high concurrency
Your query patterns combine vector similarity with relational filters (e.g., WHERE project_id = $id AND status = 'active')
You want to handle backups, migrations, and consistency in one system

Consider a dedicated vector database (Pinecone, Weaviate, Qdrant) when:

You need sub-5ms similarity queries under heavy concurrent load
Your embedding table is expected to grow beyond 10-20 million rows
Your primary query pattern is pure vector similarity with no relational context
You want automatic scaling and don't want to tune Postgres for vector workloads

For most API platforms serving AI memory, pgvector is the right starting choice. The vast majority of such systems never reach the scale where a dedicated vector database's advantages materialize. Ship with pgvector, and establish the monitoring that tells you when you need to migrate.

II – Extension Management and Index Strategy

The extension installation is one step. The operational setup is many.

CREATE EXTENSION IF NOT EXISTS vector;
ALTER TABLE memories ADD COLUMN embedding vector(1536);

You now have a vector column. You do not have a vector index. Without an index, every similarity query is a sequential scan — it computes the distance from your query vector to every row in the table. At 10,000 rows, this is fast. At 1 million rows, it's 500ms+ per query.

Add the index:

-- HNSW index: better query performance, slower build
CREATE INDEX idx_memories_embedding_hnsw ON memories
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- IVFFlat: faster to build, slightly lower recall
CREATE INDEX idx_memories_embedding_ivfflat ON memories
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

HNSW (Hierarchical Navigable Small World) is the right choice for most production systems. It has higher query performance and better recall at query time. The tradeoff is slower index builds and more memory usage. For read-heavy workloads where write throughput is not the primary concern, HNSW is the right choice.

IVFFlat builds faster and uses less memory. It divides the vector space into lists clusters. At query time, it searches only the most relevant clusters. Lower recall than HNSW at the same ef_search. Use IVFFlat if you need faster index builds or lower memory consumption.

The distance metric matters. vector_cosine_ops for normalized embeddings (OpenAI embeddings are normalized). vector_l2_ops for raw embeddings where Euclidean distance is more appropriate. Mixing the metric at index build time and query time produces nonsense results.

III – The Filter Interaction Problem

This is the most common pgvector performance problem.

A query like this:

SELECT id, content, embedding <=> $query_embedding AS score
FROM memories
WHERE project_id = $projectId
  AND status = 'active'
ORDER BY score
LIMIT 20;

This query has a problem. The vector index is built on the embedding column globally. But the WHERE clause filters to a subset of rows by project_id and status. The query planner may choose to:

Use the vector index (fast similarity scan) then filter by project_id (most rows discarded after the scan) — high recall but wasteful
Filter by project_id first (fast, uses relational index) then scan filtered rows for similarity (slow, no vector index usage)

Neither is ideal. The query planner often makes the wrong choice, particularly when the project_id filter is highly selective (one project out of thousands).

The solutions, in order of preference:

Partial index per tenant. Not practical at large tenant counts.

pgvector's ef_search tuning. Increase the HNSW ef_search parameter to scan more candidates before applying filters. This improves recall at the cost of latency.

SET hnsw.ef_search = 200;  -- default 40, higher = more candidates scanned

Pre-filter with CTE, then vector search. Restrict the candidate set before running similarity:

WITH project_memories AS (
  SELECT id, content, embedding
  FROM memories
  WHERE project_id = $projectId
    AND status = 'active'
    AND deleted_at IS NULL
)
SELECT id, content, embedding <=> $query_embedding AS score
FROM project_memories
ORDER BY score
LIMIT 20;

This is a sequential scan on the pre-filtered set. For small projects (< 50K records per project), this is fast enough. For large projects, you need the index.

Partition by project. Table partitioning by project_id allows pgvector to build per-partition indexes. Vector scans stay within the partition. This is the correct solution for high-scale deployments but adds operational complexity.

IV – Vacuum and Bloat

pgvector creates additional challenges for Postgres's autovacuum.

Every update to a row (including embedding updates when you re-index content) creates a dead tuple. Autovacuum reclaims dead tuples. For tables with frequent embedding updates, autovacuum needs to run more aggressively than its default settings.

If autovacuum can't keep up, dead tuples accumulate. The table bloats. Scan times increase because Postgres reads through dead tuples even if they're not returned. Vector index quality degrades because HNSW nodes pointing to deleted tuples cause unnecessary comparisons.

Monitor bloat:

SELECT
  schemaname,
  relname AS tablename,
  n_dead_tup,
  n_live_tup,
  round(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables
WHERE relname = 'memories'
ORDER BY dead_pct DESC;

If dead_pct is consistently above 10-15%, tuning autovacuum is warranted:

ALTER TABLE memories SET (
  autovacuum_vacuum_scale_factor = 0.01,  -- vacuum when 1% of rows are dead (default 20%)
  autovacuum_analyze_scale_factor = 0.005
);

V – Connection Pooling and Workload Isolation

Vector queries are CPU-intensive. A pgvector similarity search under load spikes CPU on the Postgres instance. If your application connection pool allows enough concurrent vector queries, you can saturate the database CPU and degrade all other queries.

Workload isolation: vector queries should run through a separate connection pool than transactional queries.

Practically, this means:

A pgvector pool with 5-10 connections, strict timeouts (500ms query timeout)
A transactional pool with 20-50 connections, longer timeouts

Vector queries that exceed 500ms are either hitting a missing index or are under unusual load. Both are worth alerting on. Don't let them run indefinitely.

Use PgBouncer in transaction mode for your connection pooling. Session mode holds connections for the duration of a session, which is wasteful for an API with short-lived requests.

VI – What Breaks First

Full-table vector scans under filter-heavy queries. The query planner decides the relational filter is cheap and skips the vector index. Result: similarity queries that take 800ms on a 2M row table. Fix: use EXPLAIN ANALYZE on every vector query you write, before shipping. If it's not using the index, it's a problem.

Embedding table drift from source records. Content gets updated. The embedding is not re-generated. The vector index now represents the old version of the content. Queries about the new content return nothing or return stale results. Fix: track embedded_at and updated_at on every row. The background job that re-embeds content should process rows where updated_at > embedded_at.

Pool starvation under mixed workloads. A burst of vector queries consumes all connections in the pool. Transactional queries queue and time out. API requests start failing. Fix: separate pools with separate limits. Vector queries are bounded to their pool. Transactional queries are never starved by vector workloads.

pgvector Performance Checklist

Scaling Escape Hatch

When pgvector can no longer serve your query latency requirements:

Stand up a dedicated vector database alongside Postgres
Write new embeddings to both systems (dual-write)
Backfill historical embeddings from Postgres to the dedicated system
Migrate reads to the dedicated system
Drop the embedding column from Postgres

This migration is cleanest when your vector query layer is behind an interface. If your application code calls vectorDB.query() rather than db.RawQuery(pgvectorSQL), the migration is a configuration change. If it calls the pgvector SQL directly everywhere, the migration touches every query site.

Abstract early. Migrate later, only when the data demands it.

0 comments

I – When pgvector Is the Right Choice #

II – Extension Management and Index Strategy #

III – The Filter Interaction Problem #

IV – Vacuum and Bloat #

V – Connection Pooling and Workload Isolation #

VI – What Breaks First #

pgvector Performance Checklist #

Scaling Escape Hatch #