Why does retrieval latency spike hardest at cold start instead of under load?

Cold start hits every stage of the pipeline at once: no warm embedding cache, no in-memory index segments, and no prefetched candidate pool. Under sustained load you at least benefit from cached embeddings and hot index pages. Cold start forces full recomputation from scratch, which is why first-query latency can be 3-5x higher than steady-state P50.

What's the difference between approximate nearest neighbor search and exact kNN for agent memory?

Approximate nearest neighbor (ANN) trades a small accuracy loss for massive speed gains by using graph-based or tree-based indexes that skip full vector comparisons. Exact kNN computes distance to every vector in your index, which becomes prohibitively slow past a few thousand embeddings. Production systems use ANN for the initial candidate retrieval, then optionally rerank with exact scoring on the top results.

How do I know if index fragmentation is killing my search performance?

Track your vector search latency trend over time alongside write volume. If P95 latency degrades 40-60% over weeks without traffic growth, you've got fragmentation. Most vector DBs expose a segment count metric—watch for it climbing without corresponding merges. The fix is usually forcing index consolidation or switching to a database that handles segment merging automatically.

Should I cache embeddings or cache search results for lower latency?

Cache embeddings if your queries repeat exactly and you can skip the encoding step entirely (5-10ms savings). Cache search results if your query patterns vary but land on the same memory slice repeatedly (40-80ms savings including the vector search itself). For most agents, result caching wins because natural language queries vary in phrasing even when targeting the same information.

Can I use a smaller embedding model to hit sub-100ms retrieval targets?

Yes, but dimension count matters more than model size. A 384-dimension embedding from a fast encoder like all-MiniLM-L6-v2 encodes in under 20ms and searches faster than 1536-dimension models, but you sacrifice 10-15% recall quality. For voice AI where speed is non-negotiable, the tradeoff usually works. For complex reasoning agents, stick with larger embeddings and optimize elsewhere.

What causes thundering herd failures in multi-agent memory systems?

Thundering herd happens when multiple agent sessions start simultaneously and all query memory at once, overwhelming your vector DB with concurrent requests it can't parallelize. The P99 latency collapses even though P50 looks fine because the database serializes requests under contention. Rate limiting at the orchestration layer or staggering session initialization by a few hundred milliseconds prevents the collapse.

Vector search vs keyword search: which one should run first in hybrid retrieval?

Run keyword search first when you can extract high-signal terms (entity names, technical keywords, exact phrases) because it short-circuits the expensive vector operation entirely. Run vector search first when the query is conversational or lacks strong keyword signals. Supermemory runs them in parallel and merges results with context-aware reranking to avoid forcing a sequential dependency.

How does reranking add 50-150ms when it only scores 20-50 candidates?

Reranking isn't just scoring—it's running a second model inference pass (usually a cross-encoder) that computes relevance between the query and each candidate document. That's 20-50 forward passes through a transformer, not a simple vector dot product. The latency scales with both candidate count and reranker model size, which is why two-stage retrieval limits the candidate pool aggressively.

When does async prefetching actually hide retrieval latency in conversational agents?

Async prefetching works when you can predict the next memory need during user think time (the gap between agent response and next user message). If your agent can speculate on likely follow-up queries and fetch those results in the background, the retrieval cost is completely hidden by the time the user actually asks. It fails when queries are unpredictable or think time is too short to complete the prefetch.

What's the real cost of going from 200ms to 100ms retrieval in production?

Cutting retrieval latency in half usually means doubling infrastructure spend or accepting a 10-15% recall drop. You're either moving to in-memory storage, scaling up to more powerful vector DB instances, or switching to approximate search with looser precision. The question isn't whether you can hit 100ms—it's whether the use case justifies the cost or quality tradeoff required to get there.

Learning

Latency Budgets for Memory Retrieval: Targets, Tradeoffs, and Failure Modes

Q: What's the fastest vector database for agent memory retrieval in 2026?

Redis hits 5ms at p50 for in-memory workloads, but Qdrant delivers 4ms for purpose-built vector operations. Postgres with pgvector stays under 100ms at 99% recall, which works for most production agents that aren't running voice AI.

Q: Can I hit sub-100ms retrieval without sacrificing recall quality?

Yes, but you need two-stage retrieval. Run a fast approximate search to narrow candidates in under 10ms, then rerank the top 20-50 results. Total cost is 40-80ms instead of 200ms+ for full precision, and you preserve the quality that matters.

Q: Memory retrieval latency vs generation latency: which matters more for agent response time?

Both consume your total response budget, but retrieval latency bites first and compounds harder. A 300ms retrieval delay leaves almost nothing for generation in voice AI (800ms total budget). Track them separately in your instrumentation—most "slow generation" problems are actually upstream retrieval bottlenecks.

Q: How do I allocate my agent's latency budget across the full pipeline?

For a 500ms retrieval window: allocate 50ms to network/edge, 80ms to orchestration, 120ms to primary vector search, 100ms to reranking and downstream calls, and reserve 150ms for variance buffer. Size that buffer against your P95 latency, not P50—cold cache misses and index spikes will eat it immediately.

Q: When does caching hot memories actually reduce agent memory latency?

When frequently accessed context repeats across sessions. Serving from in-memory cache brings retrieval under 5ms for those hits, but only if your access patterns show concentration. For cold-path queries with low repeat rates, async prefetching during user think time hides latency better than static caching.

Shardul Mane

13 May 2026 • 8 min read

Your agent's agent memory latency budget says 200ms for retrieval, but you're hitting 350ms in production because the buffer you built for variance just got eaten by a reranking spike. LLM inference gets the biggest chunk of your response time, sure, but retrieval needs its own explicit allocation broken down by stage: embedding generation, vector search, reranking, result assembly. Each one has a different latency profile, and a 4ms vector search doesn't help if your embedding call upstream took 400ms. We're going to show you how to size that budget correctly, which tradeoffs bite hardest, and the three failure modes that collapse P99 latency even when P50 looks healthy.

TLDR:

Voice AI needs sub-100ms retrieval to hit <800ms total response time; chat agents get 200ms before the budget breaks.
Index fragmentation silently degrades search by 40-60%, and most teams size variance buffers around P50 instead of P95.
Two-stage retrieval cuts latency to 40-80ms by running fast approximate search first, then reranking only the top candidates.
Track retrieval, generation, and network overhead separately or you'll blame the wrong bottleneck when extraction takes 20-40 seconds.
Supermemory hits sub-400ms at scale with hybrid vector + keyword search and a custom graph engine that skips full index scans.

Understanding Memory Retrieval Latency

Retrieval latency is the gap between when a query fires and when relevant data comes back. For agents operating in real-time conversation loops, that gap directly shapes whether the interaction feels responsive or broken. Users perceive delays above 100ms, and anything past a few hundred milliseconds registers as a stall.

The Five-Stage Retrieval Pipeline

Every retrieval call passes through five discrete stages before your agent gets anything useful back. Latency accumulates across all of them.

Query processing: raw input gets parsed, normalized, and prepared for embedding
Embedding generation: the query is converted into a vector representation, requiring a model inference call
Similarity search: the vector index scans for nearest neighbors, typically accounting for the largest share of total retrieval time
Reranking and filtering: candidates get scored by semantic relevance and pruned based on metadata or threshold rules
Result assembly: the retrieved results get packaged and returned to the calling application

Each stage has its own latency profile. A fast vector search means nothing if your embedding call takes 400ms upstream of it.

Setting Performance Targets for Production Systems

Performance targets have settled around hard numbers, and the bar has risen sharply. Three seconds of end-to-end agent response time was workable in 2024. By 2026, that number is a dealbreaker. Users expect responses under a second, and voice AI demands even tighter margins.

Voice AI agents need sub-100ms retrieval to hit under 800ms total response time. Conversational chat agents get 200ms before the budget breaks, enterprise copilots can stretch to 400ms within a 3-second window, and batch or async workflows have no strict ceiling past the 1-second mark.

At the vector search layer, production systems need sub-50ms similarity search even across millions of embeddings. Blow past it, and you've consumed most of your retrieval budget before reranking even runs.

Voice AI carries the sharpest constraint. A 300ms embedding call leaves almost nothing for everything else downstream, which is why voice agents often cache embeddings aggressively or skip reranking entirely.

Allocating Your Latency Budget

A latency budget is your total acceptable response time carved into slices for each component. LLM inference usually gets the biggest chunk, but retrieval needs its own explicit allocation beyond whatever's left over after everything else runs.

Strong teams break it down further than treating "retrieval" as a single line item. For a 500ms window between query and response:

50ms — edge and network
80ms — orchestration
120ms — primary retrieval reads
100ms — downstream calls and reranking
150ms — variance and retry buffer

That last line is where most systems fail quietly. P50 numbers look clean in dashboards until a cold cache or index spike hits and the retry buffer vanishes entirely.

Why the Buffer Gets Eaten First

Most teams size their variance buffer against best-case infrastructure behavior. That works until it doesn't. Reranking alone can swing 40-80ms depending on result set size, and if your orchestration layer is doing any async fan-out, that variance compounds. Build the buffer around your P95, not your P50.

Vector Database Performance Benchmarks

According to the Salttech 2026 vector database benchmark, tests using 1 million vectors at 1536 dimensions give the clearest read on real-world performance across vector databases.

Qdrant hits 4ms at p50, the lowest among purpose-built vector databases. Redis comes in at 5ms for in-memory workloads. Postgres with pgvector and pgvectorscale shows variable latency but stays under 100ms max at 99% recall.

At a 99% recall threshold, Postgres with pgvector and pgvectorscale alongside Qdrant both hit sub-100ms maximum query latency. Redis leads on raw p50 speed, but that advantage holds only while the dataset fits in memory. Push past that ceiling and you're rethinking the storage model from scratch. Real-world vector database performance benchmarks show consistent patterns across production deployments at scale.

Common Latency Tradeoffs in RAG Systems

Every RAG system forces a tradeoff between retrieval depth and speed. Fetch more chunks and you improve recall; fetch fewer and you cut latency. The problem is that neither extreme is free.

Reranking adds accuracy but stacks another 50-150ms on top of your vector search. Larger embedding models retrieve better but take longer to encode queries. Wider context windows mean more tokens for the LLM to process, which compounds your total response time well past any reasonable agent latency budget.

The tradeoff that bites hardest is recall versus speed. You can tune top_k down to hit sub-100ms retrieval, but you risk missing the memory that actually matters.

Measuring and Monitoring Retrieval Performance

Instrumentation is where most teams underinvest. Vague signals like "retrieval feels slow" aren't actionable. You need actual numbers broken down by component.

Latency in RAG systems comes from three places: retrieval, generation, and network overhead. Track each separately, not as a single end-to-end blur. The four metrics that matter most in production:

End-to-end response time across the full request cycle
Retrieval latency isolated from generation so you can see each independently
Generation latency tracked on its own to avoid false bottleneck attribution
Error rates per component, instead of only aggregate failures

Semantic search retrieval typically returns results around 200ms, while extraction and consolidation operations run 20-40 seconds in typical document processing pipelines. When those two numbers share a pipeline, a slow response isn't always a search problem. The bottleneck is often upstream in extraction, and you'll never know unless your instrumentation separates them.

Failure Modes That Break Latency SLAs

Three failure modes show up repeatedly when agents miss their latency SLAs.

Cold cache misses hit hardest on first retrieval, where no warm embedding cache exists and every query bottoms out at raw vector search plus reranking. Index fragmentation builds silently over time as write-heavy workloads create unmerged segments, significantly degrading search performance without any obvious signal. Thundering herd occurs when multiple agents query memory simultaneously at session start, collapsing p99 latency under shared load even when p50 looks healthy.

Catching these early requires tracking p99 separately from averages and alerting on cache hit rate drops before they cascade.

Optimization Techniques That Preserve Quality

Two-stage retrieval is the most reliable way to hold quality while cutting latency. A fast approximate search narrows the candidate pool in under 10ms, then a precise reranker scores the top 20–50 results. Total cost: 40–80ms instead of 200ms+ for full precision search. According to the RAG Latency Playbook, two-stage reranking achieves 95% accuracy while completing in one-third the time of full deep reranking.

Caching hot memories helps too. Frequently accessed context often repeats across sessions, and serving those from an in-memory cache brings retrieval under 5ms for the hits that matter most.

For cold-path queries, async prefetching lets agents speculate on likely memory needs before the user's next turn, hiding latency entirely behind think time.

Memory System Architecture and Latency Impact

Architecture decisions shape latency before a single query runs. Systems built on compact structured records with precomputed embeddings skip the encoding step at query time entirely, turning similarity search into a lookup against already-indexed vectors instead of a fresh compute job.

The subset principle matters just as much. Agents don't need every stored memory per turn, only the task-relevant slice. Scoping retrieval to a small candidate pool keeps search fast and avoids bloating the context passed downstream to the LLM.

In-memory storage trades cost per GB for speed. For latency-critical paths where retrieval compounds across multi-step reasoning, that tradeoff usually pays off. Disk-backed storage with session-level caching on active users covers lower-priority memory without the cost overhead.

Supermemory's Approach to Sub-300ms Retrieval

Sub-300ms recall comes from architecture, not luck. We use hybrid vector + keyword search with context-aware reranking, letting keyword signals short-circuit the retrieval path when they're sufficient. That combination avoids forcing every query through full approximate nearest-neighbor search when a faster route exists.

The benchmarks reflect it. On LongMemEval-S (open benchmark): 85.4% overall accuracy, 82.0% on temporal reasoning (vs 62.4% for competitors), 89.7% on knowledge updates (vs 77.5%). On LoCoMo: P@1 of 59.7% against Mem0's 34.4%, with Recall@10 at 83.5% vs 69.3%. Accuracy at speed, not one traded for the other.

Benchmark & Metric	Supermemory	Competitors	Performance Gap
LongMemEval-S: Overall Accuracy	85.4%	Baseline comparison	Industry-leading recall across all categories
LongMemEval-S: Temporal Reasoning	82.0%	62.4%	+19.6 percentage points on time-sensitive queries
LongMemEval-S: Knowledge Updates	89.7%	77.5%	+12.2 percentage points on evolving information
LoCoMo: Precision at 1 (P@1)	59.7%	34.4%	+25.3 percentage points for top result accuracy
LoCoMo: Recall at 10 (Recall@10)	83.5%	69.3%	+14.2 percentage points across top 10 results
Production Latency at Scale	Sub-400ms at 100B+ tokens monthly	Variable, often 500ms+	Maintains speed under real-world load

Our custom vector graph engine with ontology-aware edges tracks relationships between memories beyond raw similarity scores, so context assembly pulls the right slice without scanning the full index. Processing 100B+ tokens monthly while holding sub-400ms at scale is the result of those architectural choices compounding together.

Final Thoughts on Memory System Speed

Hitting your retrieval latency budget requires breaking the pipeline into pieces and tracking each one independently. P50 metrics hide the cache misses and index fragmentation that blow up your P95 when traffic spikes. Scope your retrieval, cache what repeats, and build variance buffers around real-world behavior. The difference between responsive agents and broken ones is architecture, not luck.

FAQ

What's the fastest vector database for agent memory retrieval in 2026?

Redis hits 5ms at p50 for in-memory workloads, but Qdrant delivers 4ms for purpose-built vector operations. Postgres with pgvector stays under 100ms at 99% recall, which works for most production agents that aren't running voice AI.

Can I hit sub-100ms retrieval without sacrificing recall quality?

Yes, but you need two-stage retrieval. Run a fast approximate search to narrow candidates in under 10ms, then rerank the top 20-50 results. Total cost is 40-80ms instead of 200ms+ for full precision, and you preserve the quality that matters.

Memory retrieval latency vs generation latency: which matters more for agent response time?

Both consume your total response budget, but retrieval latency bites first and compounds harder. A 300ms retrieval delay leaves almost nothing for generation in voice AI (800ms total budget). Track them separately in your instrumentation—most "slow generation" problems are actually upstream retrieval bottlenecks.

How do I allocate my agent's latency budget across the full pipeline?

For a 500ms retrieval window: allocate 50ms to network/edge, 80ms to orchestration, 120ms to primary vector search, 100ms to reranking and downstream calls, and reserve 150ms for variance buffer. Size that buffer against your P95 latency, not P50—cold cache misses and index spikes will eat it immediately.

When does caching hot memories actually reduce agent memory latency?

When frequently accessed context repeats across sessions. Serving from in-memory cache brings retrieval under 5ms for those hits, but only if your access patterns show concentration. For cold-path queries with low repeat rates, async prefetching during user think time hides latency better than static caching.