All posts
aillmarchitecturecontext

Your Retrieval Is Working. Your Context Isn't.

Search output is not prompt-ready context. Assembly is a first-class subsystem.

The retrieval pipeline is healthy. The vector search is returning relevant results. The fallback chain is tuned. The query is hitting the right records.

And the model still doesn't know what the user is talking about.

This is a context assembly problem. Not a retrieval problem.

Most teams treat context assembly as plumbing — a step between retrieval and the model call, where you concatenate some strings and move on. That's the mistake. Context assembly is a first-class subsystem with its own failure modes, its own observability requirements, and its own design decisions that materially affect model output quality.

This post covers what a production context assembly system looks like, how to design it, and what breaks when you don't.

I – What Context Assembly Is Actually Doing

Retrieval returns a list of candidates ranked by relevance. Context assembly takes that list and produces a single, formatted, token-budgeted string that will be injected into a prompt.

The problems that assembly must solve:

  1. Budget enforcement. The model has a context window. The retrieved candidates may exceed it. Something must be dropped, and the dropping must be principled.

  2. Ordering for coherence. The order in which items appear in the prompt affects model behavior. The model attends more strongly to content at the beginning and end. The middle gets less attention. Important items need to be placed accordingly.

  3. Format contracts. The model needs to understand what it's reading. Is this a list of memories? A set of documents? A timeline? The format must be consistent and parseable.

  4. Determinism. Given the same inputs, the assembly must produce the same output. Non-deterministic assembly produces non-deterministic model behavior. This is difficult to debug and impossible to reproduce.

  5. Safety. The assembled context must not include content that should not be in this prompt — content from other users, expired items, quarantined items, items marked as sensitive.

These are not trivial problems. Building them correctly once is substantially better than finding them broken in production.

II – Priority Tiers

Not all retrieved items are equal. Define priority tiers and enforce them in the assembly logic.

Pinned. Items explicitly marked as always-include. System-level context, user preferences, standing instructions. These are included first, regardless of relevance score. They are never dropped, even when the budget is tight. If pinned items alone exceed the budget, that is a configuration error.

Verified. High-confidence, high-importance items. The user has explicitly marked them or the system has inferred high importance. These are included after pinned items and are dropped only as a last resort.

Candidate. Standard retrieved items ranked by relevance. These fill the remaining budget after pinned and verified items are placed. They are dropped first, from lowest relevance score upward, when the budget is exceeded.

The tier assignment happens before assembly, not during. Retrieval returns candidates. The assembly step receives a list where each item already has a tier label and a relevance score. Assembly only needs to make placement and truncation decisions, not importance decisions.

III – Token Estimation and Strict Budgeting

The context window is a hard limit. Exceeding it is an API error. Working right up to the edge is risky — leave headroom for the system prompt, the user message, and the model's output.

A practical budget allocation for a 128K context window:

Total context window:     128,000 tokens
System prompt:              2,000 tokens
User message:               1,000 tokens
Model output headroom:      4,000 tokens
                          ---------
Available for retrieved     121,000 tokens
context:
Safety margin (10%):        12,100 tokens
                          ---------
Effective context budget:   108,900 tokens

Never fill to the hard limit. Use 85-90% of the available space. The 10-15% margin absorbs estimation errors and prompt variations.

Token estimation must be fast and local — no API calls. Use a tokenizer library that matches your model's tokenizer. For OpenAI models, tiktoken. For Anthropic models, approximate with the rule len(text) / 3.5 (characters to tokens). The approximation is acceptable for budget planning; exact counts are not worth the latency.

def estimate_tokens(text: str) -> int:
    # Approximation: ~3.5 chars per token for English
    return len(text) // 3 + 1  # round up

def fits_in_budget(items: list, budget: int) -> bool:
    total = sum(estimate_tokens(item.content) + 20 for item in items)  # +20 for formatting overhead
    return total <= budget

IV – Truncation Strategy

When the budget is exceeded, you must drop items. The truncation strategy determines which items get dropped.

Wrong approach: drop items from the end of the ranked list. This is a reasonable first approximation but ignores tier assignments.

Right approach:

  1. Start with all pinned items. If they exceed budget: hard error (configuration problem).
  2. Add all verified items in order of relevance score. If adding a verified item exceeds budget: stop adding verified items, log a verified_item_dropped event.
  3. Fill remaining budget with candidate items in order of relevance score.
  4. If no candidate items fit: log no_candidates_fit, return pinned + verified only.

The verified_item_dropped event is important. If you're regularly dropping verified items, your budget is too small or your verified item set is too large. This is a tuning signal that should surface in your monitoring.

Never silently drop important items. Every drop should emit a telemetry event.

V – Ordering Policy

The order of items in the assembled context is not just aesthetic. The model's attention distribution varies by position.

For most retrieval-augmented use cases, a reliable ordering policy is:

  1. Pinned items first. Standing instructions, user preferences, system context.
  2. Recency-ordered verified items. Most recent first, so the model sees current state before historical state.
  3. Relevance-ordered candidate items. Highest relevance first, to ensure the best matches are seen before the budget runs out.

This is not the only valid policy. For a timeline reconstruction use case, chronological ordering of all items might be more appropriate. For a document-style context, grouping by source might work better.

The policy must be explicit and documented. "We order items by relevance score" is not a complete policy. Document the tier ordering, the within-tier ordering, and the rationale.

VI – Formatting Contracts

The model needs to understand the structure of what it's reading. A flat concatenation of text chunks is not a format contract.

A minimal format contract:

<memory id="mem_abc123" importance="high" created="2024-01-15">
User prefers detailed technical explanations. Has background in distributed systems.
</memory>

<memory id="mem_def456" importance="medium" created="2024-02-03">
Working on a rate limiting implementation for the API. Using Redis for counters.
</memory>

Each item has:

  • An identifier (for debugging and attribution)
  • A tier/importance signal
  • A timestamp (for temporal reasoning)
  • The content itself

The format is consistent across all items. The model can learn the structure and reason about it. IDs enable you to trace which items influenced a model response if you log the assembled context.

Do not vary the format between retrieval paths. Items from vector search and items from lexical fallback should use the same format in the assembled context.

VII – Determinism Requirements

The same query with the same stored data should produce the same assembled context. Every time.

Non-determinism enters through:

  • Random ordering of tied scores (sort by (score, id) to break ties deterministically)
  • Token estimation variance (use a fixed estimation function, not one that varies by input)
  • Items added or removed between calls (snapshot the retrieval results before assembly)
  • Time-dependent ordering (if sorting by recency, use the same timestamp reference throughout the assembly call)

Determinism matters for:

  • Debugging. You cannot reproduce a model behavior bug if the context varies each time.
  • Testing. Snapshot tests of context assembly require deterministic output.
  • User trust. A user who asks the same question twice and gets different answers loses trust in the system, even if both answers are technically correct.

Test determinism explicitly:

def test_context_pack_is_deterministic():
    items = get_test_items()
    pack1 = assemble_context(items, budget=4000)
    pack2 = assemble_context(items, budget=4000)
    assert pack1 == pack2

This test should be in your CI pipeline, running on every change to the assembly logic.

VIII – What Breaks First

Over-budget context causing API failures. The most obvious failure. Token estimation was off. The assembled context exceeds the model's context window. The API returns a context length error. The error is confusing because retrieval succeeded. The fix is tighter budget margins and better estimation. The detection is monitoring the token count of assembled contexts and alerting when they exceed 90% of the budget.

Important items dropped due to naive ordering. A user stores a high-importance preference: "Never recommend X framework." It's marked as candidate because it was added early and the system doesn't re-evaluate importance. It falls off the end of the list when the budget is tight. The model recommends X. The user is frustrated. The fix: verified and pinned tiers exist for a reason. Promote items that should never be dropped.

Non-deterministic packs causing inconsistent outputs. Two consecutive identical queries produce different model outputs. The retrieved items are the same, but the assembly order differs because of a sort tie that resolves differently each call. The fix: deterministic sort keys everywhere.

Context Pack Schema

{
  "pack_id": "cpk_abc123",
  "query": "what framework should I use?",
  "assembled_at": "2024-03-01T14:22:31Z",
  "budget_tokens": 4000,
  "used_tokens": 3847,
  "items": [
    {
      "id": "mem_abc",
      "tier": "pinned",
      "relevance_score": null,
      "tokens": 42,
      "included": true
    },
    {
      "id": "mem_def",
      "tier": "candidate",
      "relevance_score": 0.91,
      "tokens": 156,
      "included": true
    }
  ],
  "dropped_count": 3,
  "drop_reasons": ["budget_exceeded"],
  "retrieval_mode": "vector_primary"
}

Log this structure for every assembly call. It is the most useful debugging artifact you can produce for context-related model behavior issues.

The assembled context is not a side effect of retrieval. It is a product in its own right. Design it accordingly.

0 comments

Join the conversation

Enjoyed this? Subscribe for more.

Get new essays on software architecture, AI systems, and engineering craft delivered to your inbox. No spam-ever.