Vector Search Is Not a Retrieval Strategy. It's One Piece of One.
Why reliable retrieval needs layered fallback chains, not vector-only optimism
Every team building a RAG system discovers the same thing eventually.
Vector search returns empty results on data you know is there. The user asks about something they stored yesterday. The vector index returns nothing. The model says it doesn't know. The user loses trust in the system.
Or worse: the vector search returns results, but they're stale. The user edited the content two weeks ago. The embedding was never updated. The retrieved context contradicts what the user currently believes.
Vector search is a powerful primitive. It is not, by itself, a retrieval strategy. A retrieval strategy accounts for when vector search fails, what to fall back to, how to combine signals, and what to do when your embedding provider is down.
This post covers what a production retrieval pipeline actually looks like.
I – Retrieval Objectives
Before building anything, name what you're optimizing for. There are four objectives, and they conflict.
Precision: the retrieved results are relevant to the query. High precision means fewer irrelevant results.
Recall: all relevant items are retrieved. High recall means fewer missed relevant results.
Latency: the retrieval completes quickly. Under 200ms for a user-facing system. Under 50ms for an inline agent tool.
Cost: the retrieval doesn't consume more resources than the value it provides. Embedding API calls, vector index queries, and re-ranking models all have costs.
Precision and recall trade against each other. Latency and recall trade against each other. A system that optimizes only for one of these will fail at the others.
Name your priorities before you build. For most agent memory systems: precision first (wrong results are worse than no results), recall second, latency third, cost last. Then build the system that serves those priorities.
II – The Semantic Search Path
The primary retrieval path uses embeddings and vector similarity.
The flow:
- Embed the query using the same model that embedded the stored content
- Query the vector index with the embedding and a similarity threshold
- Apply hard filters: project scope, status (active, not deleted), content type
- Return top-K candidates ranked by similarity score
The critical constraint: the embedding model must match. If stored content was embedded with text-embedding-ada-002 and the query is embedded with text-embedding-3-small, the similarity scores are meaningless. The dimension counts don't even match. This is an obvious mistake that is surprisingly easy to make when upgrading embedding models.
The filter ordering matters. Apply hard filters before the vector search if your vector database supports pre-filtering (most do). Filtering after the vector search means you've wasted similarity computation on records you would have excluded anyway. On large datasets, post-filtering also means your top-K candidates are depleted before the filter, and you might return fewer results than requested.
query_embedding = embed(query_text)
results = vector_index.query(
embedding=query_embedding,
top_k=50,
filters={
project_id: context.project_id,
status: "active",
deleted_at: null
},
min_score=0.7
)
III – Lexical Fallback
Vector search fails in specific and predictable ways.
It fails for exact-match queries. "Show me the record I created with the title 'Project Alpha Q3 plan'" is not well-served by semantic similarity. The user knows the exact title. A keyword search finds it instantly. Vector search might not find it at all.
It fails for new content. A record created 10 minutes ago might not have been embedded yet if embedding is async. The vector index doesn't have it. The record exists. Vector search returns empty.
It fails for content with precise identifiers. UUIDs, code snippets, specific version numbers. These don't encode meaningfully into semantic vectors.
Lexical fallback handles all of these:
SELECT * FROM memories
WHERE project_id = $projectId
AND status = 'active'
AND deleted_at IS NULL
AND to_tsvector('english', content) @@ plainto_tsquery('english', $query)
ORDER BY ts_rank(to_tsvector('english', content), plainto_tsquery('english', $query)) DESC
LIMIT 20
Postgres full-text search is free, indexed, and fast. You already have Postgres. Use it as your lexical fallback before reaching for Elasticsearch or a dedicated search service.
The fallback is triggered when vector search returns fewer results than your minimum threshold (typically 3-5 results for most applications).
IV – Broad Fallback for Sparse Datasets
Small projects and new users have sparse data. A project with 10 memories has a small vector space. Similarity thresholds that work well at 10,000 records will return nothing at 10 records.
Broad fallback: when both vector and lexical search return nothing, fall back to recency-ordered retrieval.
SELECT * FROM memories
WHERE project_id = $projectId
AND status = 'active'
AND deleted_at IS NULL
ORDER BY
importance DESC,
created_at DESC
LIMIT 10
This returns something useful even when the query can't be matched. For agent context — where some context is almost always better than no context — this is the right behavior. Log when broad fallback is triggered. A high rate of broad fallback is a signal that your primary retrieval paths are misconfigured or that the dataset is too sparse to serve the use case.
V – Re-Ranking and Score Harmonization
Vector search, lexical search, and recency ordering produce scores in different units that cannot be directly compared.
A vector similarity score of 0.85 and a BM25 lexical score of 12.3 do not mean the same thing. Combining them requires normalization and a blending policy.
The practical approach for most systems:
- Collect candidates from each retrieval path independently (vector: top 20, lexical: top 20)
- Deduplicate by ID
- Score each candidate using a weighted combination:
final_score = (0.6 * normalized_vector_score) + (0.3 * normalized_lexical_score) + (0.1 * recency_score) - Sort by final_score, return top-K
The weights are not magic numbers. Start with these defaults and tune based on your quality evaluation data. Track user feedback signals (which recalled items get used, which get ignored) and adjust.
A more sophisticated approach is a cross-encoder re-ranking model that scores (query, document) pairs directly. This is more expensive but substantially improves precision. Add it after you have the basic fallback chain working, not before.
VI – Circuit Breakers for Embedding Provider Failures
Your embedding provider will be unavailable. OpenAI has had outages. Cohere has had outages. Every provider has had outages.
If your retrieval pipeline requires a live embedding API call for every query, every provider outage is a full retrieval outage.
Two mitigations:
Query-time embedding cache. Cache embedding results for identical query strings. A recall("latest project notes") that gets called 100 times in a session only hits the embedding API once.
Degrade gracefully to lexical-only. If the embedding API returns an error or times out (use a 500ms timeout, not the SDK default), fall back to lexical search without embedding. Log the degradation event. Return results with a retrieval_mode: "degraded_lexical" flag so callers know.
if embedding_provider.healthy():
results = vector_search(embed(query))
else:
emit("retrieval_degraded", reason="embedding_provider_unavailable")
results = lexical_search(query)
Lexical search during embedding provider outages means you return something useful rather than an error. For most queries, lexical results are good enough. The user doesn't see an error. The system degrades gracefully.
VII – What Breaks First
Empty results with verified data present. The most confusing failure. The user stored something. They ask for it. Nothing is returned. Root cause: almost always a filter issue. The status filter is excluding active records. The project scope is wrong. The similarity threshold is too high. Start debugging with broad fallback turned on and no similarity threshold. If that returns results, tighten filters one at a time until the bug appears.
Vector quality drift after content edits. A user edits a stored memory. The text changes. The embedding in the vector index is now stale — it reflects the old content. Retrieval works correctly for the old query but not for queries about the new content. Fix: re-embed on every significant content update. Define "significant" (>20 character change, different nouns). Queue re-embedding as an async job.
Latency spikes from fallback fan-out. When the primary path misses and both fallback paths are triggered sequentially, latency doubles or triples. Run fallback paths concurrently where possible. Start lexical search at the same time as vector search. Use the first path that returns sufficient results. This requires careful result merging but is worth the complexity at any meaningful scale.
Retrieval Decision Graph
query received
|
v
embed query ──(timeout/error)──> lexical_search_only ──> return + log degraded
|
v
vector_search(embedding, filters)
|
+─── results >= min_threshold ──> re_rank ──> return
|
v
lexical_search(query_text, filters)
|
+─── results >= min_threshold ──> re_rank ──> return
|
v
broad_fallback(recency + importance, no query)
|
v
return (possibly empty) + log
Failure Budget Policy
- Vector search miss rate > 5%: investigate embedding freshness
- Lexical fallback trigger rate > 20%: investigate vector index health
- Broad fallback trigger rate > 5%: investigate dataset density and filter logic
- Embedding provider timeout rate > 1%: review timeout settings and provider SLA
Build the fallback chain. Test it. Monitor the fallback rates. The fallback rate is one of the most informative signals about your retrieval system's health — and it's the one nobody tracks until something goes visibly wrong.
0 comments