Engineering

How We Built Semantic Memory for Agent Conversations at Scale

By Marcus Chen • April 14, 2026 • 11 min read

Infrastructure diagram showing semantic memory indexing pipeline for high-volume AI agent conversations

Building a semantic memory layer for a handful of agents in a controlled environment is a solvable problem with existing tools. Building one that handles tens of millions of indexed conversations, sub-50ms recall latency, and multi-tenant isolation across hundreds of customer deployments is a different engineering challenge entirely. This is a detailed account of how we approached it, where we were wrong in our early assumptions, and what the system looks like today.

The Early Version: What We Got Wrong

Our first implementation treated memory indexing as a synchronous operation. A conversation turn arrives, we embed it, write it to the vector index, and acknowledge. This feels clean in a prototype — the memory store is always up to date, retrieval reflects the latest state. In production it's a latency disaster. Embedding generation takes 30-80ms per turn, and blocking the agent response path on an embedding call means your agent's response time is directly coupled to your embedding model's performance. Every hiccup in the embedding service causes user-visible delays.

We also started with a flat memory model — all conversation turns treated as equally retrievable facts. This works fine at low volume. At scale, it creates a recall quality problem: the index fills with low-signal noise. Every "ok thanks" and "got it" turn was being indexed and retrieved. Query results were polluted with conversational filler, and the signal-to-noise ratio of retrieved context degraded steadily as the index grew.

Third mistake: no eviction strategy. Memories accumulated without bound. An active user running 10 conversations a day would accumulate thousands of indexed turns in a few weeks. When you retrieve the k nearest neighbors, you're searching an ever-growing haystack — and search latency grows with index size unless you partition aggressively.

Async Indexing with Bounded Lag

The first architectural fix was decoupling write and index operations. Conversation turns are now written to a durable message queue immediately upon arrival — acknowledgment to the agent is instant, based on the write confirmation, not the index update. A separate indexing worker pool consumes from the queue, generates embeddings, and writes to the vector store asynchronously.

The tradeoff is an indexing lag — memories written in the last few seconds may not yet be retrievable. For most use cases, this is fine. Agents rarely need to retrieve something they just said in the same turn. Where recency matters for retrieval, we maintain a short in-memory buffer of the last N turns that bypasses the index entirely. The retrieval path checks this buffer first, then queries the persistent index for older memories.

This architecture keeps the agent response path clean of any indexing latency while ensuring memories are available within seconds for subsequent turns. The indexing workers scale independently of the serving path, which means a spike in write volume doesn't degrade recall latency for reads.

Memory Extraction Instead of Turn Indexing

The second major change was replacing raw turn indexing with a memory extraction step. Instead of embedding every turn verbatim, the indexing pipeline passes batches of turns through a lightweight extraction model that identifies what's worth remembering. Factual assertions about the user, stated preferences, task goals, decisions made — these become discrete memory entries. Conversational filler, system acknowledgments, and procedure-only content are either discarded or stored only in the temporal log without semantic indexing.

This reduces index size by roughly half for typical conversation patterns, while significantly improving retrieval precision. The retrieved context is no longer polluted with noise. When the agent asks "what do I know about this user's workflow preferences?" it gets back structured preferences, not a mix of preferences and "okay, let me check on that."

Memory extraction adds latency to the indexing path, but since indexing is now async, this doesn't affect agent performance. The extraction model runs on the worker nodes, which are sized for throughput rather than latency.

Auto-Compaction for Long-Running Users

The third major system is auto-compaction. For users with many months of interactions, the raw memory store grows large enough that search quality degrades and storage costs become meaningful. Compaction runs as a background process on a per-user basis, triggered when the user's memory store crosses configurable thresholds.

Compaction works by grouping related memories into higher-level abstractions. A cluster of memories about a user's programming preferences gets consolidated into a single structured preference record. A sequence of memories about a completed project gets summarized into an episode summary. The raw underlying memories are preserved in a cold archive — retrievable but not searched by default — while the compacted representations serve everyday queries.

The challenge with compaction is knowing when the consolidation is lossy. We use confidence scoring on compacted records — high confidence when the underlying memories are consistent and reinforcing, low confidence when they're contradictory or thin. Low-confidence compacted records trigger a fallback to the raw memory search, which catches edge cases where the summary missed important nuance.

Scaling the Index

At the scale of millions of indexed conversations, a single flat vector index is no longer viable. Search latency grows with index size, and the index itself needs to be updated continuously as new memories arrive. We partition primarily by tenant and user, which keeps individual index shards small and ensures row-level isolation is structurally enforced rather than filtered at query time. Per-user shards are small enough that even brute-force nearest-neighbor search is fast — and for large users, we maintain approximate nearest-neighbor structures that bound latency at the cost of a small recall degradation.

The result is that recall latency stays under 50ms across the index regardless of overall system size, because each query is executed against a bounded shard rather than the full corpus. Tenant isolation is a structural property of the partitioning scheme, not a query-time filter that could be misconfigured.

What We'd Do Differently

The biggest lesson is that the memory extraction step should have been in the initial design, not a retrofit. Fixing index quality after a year of raw turn accumulation required significant backfill work — re-processing millions of stored turns through the extraction pipeline and rebuilding the index. That work cost weeks that could have been avoided.

We'd also have invested earlier in the compaction infrastructure. Memory stores that grow without compaction eventually become liabilities — the index quality degrades and the storage cost is hard to explain to customers. Compaction should be a first-class feature, not a reactive fix.

The architecture today is solid. Sub-50ms recall, high extraction precision, auto-compaction for long-running users, and per-tenant isolation that's structural rather than filtered. Getting here required two significant rebuilds. If you're starting from scratch, the architecture in this post is where to begin — not where we began.

CoreCast handles this entire infrastructure — async indexing, memory extraction, auto-compaction, and row-level isolation — so you don't have to rebuild it.

Start Building or Back to Blog