Two paradigms for grounding LLMs in real knowledge. When to retrieve, when to cache, and how to build both in production.

May 2026 · 3,400 words


Every LLM you ship has the same flaw baked in. It knows only what it was trained on. Give it a question about your internal docs, last week's news, or a customer's specific account and it will confabulate with confidence.

RAG and CAG are the two serious answers to this problem. They are not competing philosophies. They are different tools with different cost profiles, latency characteristics, and failure modes. Knowing which one to reach for, and when to combine them, is one of the most consequential architectural decisions in production AI.

This guide covers both mechanisms properly: how they work, where each wins, the production stacks behind them, and the hybrid pattern the best systems are converging on.

Overview: RAG vs CAG

1. The Problem Both Solve

The era of "train the model on everything" is over for most applications. Fine-tuning on proprietary data is expensive, slow to update, and does not solve the retrieval problem. It just moves where the knowledge lives. Retrieval and context-loading are now the primary mechanisms for domain-specific AI.

RAG and CAG both answer the same question: how do you give an LLM access to knowledge beyond its training data without retraining it?

They answer it differently.

RAG (Retrieval-Augmented Generation), introduced by Lewis et al. at Meta AI in 2020, fetches relevant documents at query time, injects them as context, and generates a grounded answer. The corpus lives in a vector database. The model never sees documents it does not need.

CAG (Cache-Augmented Generation), proposed by Chan et al. and published at ACM Web Conference 2025, preloads the entire corpus into the model's context window before any query arrives. The KV cache stores the resulting attention state. Every subsequent query runs against that pre-computed context at near-zero retrieval overhead.

Same problem. Fundamentally different trade-offs.


2. How RAG Works

RAG operates in two distinct phases. The ingestion phase runs once, or on a schedule. The query phase runs on every request.

The RAG Pipeline

Ingestion pipeline:

Load raw documents (PDFs, HTML, Markdown, databases) and chunk them into segments of 256-512 tokens with 10-20% overlap. Embed each chunk using an embedding model such as text-embedding-3-large, then upsert the resulting vectors plus metadata into a vector database (Pinecone, Weaviate, or Milvus).

Query pipeline:

Embed the incoming user query using the same embedding model. Run a top-k similarity search against the vector store, typically k=5-10. Optionally re-rank results with a cross-encoder for precision. Inject retrieved chunks into the LLM prompt as context, then generate a grounded, cited answer.

Key production benchmarks: p99 latency around 200-500 ms naive; around 112 ms with hybrid search and reranking (LlamaIndex + Pinecone, 2026). 92% answer relevance achievable with hybrid search.

Code: RAG with LangChain + Pinecone

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings

# 1. Ingest and chunk
loader = DirectoryLoader('./docs', glob='**/*.pdf')
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
docs = splitter.split_documents(loader.load())

# 2. Embed and store
vectorstore = PineconeVectorStore.from_documents(
    docs, OpenAIEmbeddings(), index_name='my-kb'
)

# 3. Query at runtime
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
results = retriever.invoke('What is our return policy?')

3. How CAG Works

CAG trades the retrieval step for a one-time context preload. Chan et al. describe it directly: rather than relying on a retrieval pipeline, the approach preloads the LLM with all relevant documents in advance and precomputes the key-value (KV) cache, which encapsulates the inference state of the LLM.

The core insight: modern LLMs with long context windows (200K to 2M tokens) can hold entire domain knowledge bases in active memory. The KV cache stores the attention computation for that loaded context. Inference is dramatically faster because the model only processes the incoming query tokens, not the full context again.

The CAG Lifecycle

Setup phase (run once at deploy time):

Curate all relevant documents for your domain. Format them into a single prompt and run a forward pass through the LLM. Save the resulting KV cache states to disk or GPU memory.

Inference phase (every query):

Load the pre-computed KV cache. Process only the new query tokens, with no re-computation on the corpus. Generate the answer, attending to the preloaded context at roughly 0 ms retrieval overhead. Reset the cache to its base state after each session.

Code: CAG with HuggingFace Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.cache_utils import DynamicCache
import torch

model_id = 'meta-llama/Llama-3.1-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

def preprocess_knowledge(prompt: str) -> DynamicCache:
    # One-time KV cache generation from full corpus
    inputs = tokenizer(prompt, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs, use_cache=True)
    return outputs.past_key_values  # Save this to disk

# Load corpus once
knowledge = open('knowledge_base.txt').read()
kv_cache = preprocess_knowledge(knowledge)

# Fast query with cached context -- 0 ms retrieval overhead
def answer(query: str, kv_cache: DynamicCache) -> str:
    inputs = tokenizer(query, return_tensors='pt')
    output = model.generate(
        **inputs,
        past_key_values=kv_cache,
        max_new_tokens=512
    )
    return tokenizer.decode(output[0])

Code: CAG via Anthropic Prompt Caching

For teams using Claude via API, Anthropic's native prompt caching delivers the core CAG benefit without self-hosting:

import anthropic
client = anthropic.Anthropic()
knowledge_base = open('product_docs.txt').read()

response = client.messages.create(
    model='claude-sonnet-4-20250514',
    max_tokens=1024,
    system=[
        {'type': 'text', 'text': 'You are a product expert.'},
        {
            'type': 'text',
            'text': knowledge_base,
            'cache_control': {'type': 'ephemeral'}  # Cache this block
        }
    ],
    messages=[{'role': 'user', 'content': 'How do I configure SSO?'}]
)
# First call: computes and stores the cache
# All subsequent calls: 90% cheaper, ~70% faster time-to-first-token

4. Head-to-Head Comparison

The right choice is never obvious from first principles. It depends on corpus size, data freshness requirements, infrastructure constraints, latency targets, and compliance needs.

RAG vs CAG Comparison

The honest summary: CAG wins on latency and accuracy for stable, bounded corpora. RAG wins on scale, freshness, and compliance. Neither dominates unconditionally.

5. Decision Framework

Use this to make the call quickly for your specific context.

RAG or CAG Decision Framework

Clearest CAG wins:

Product documentation Q&A. Legal contract review with a single document loaded in full. Customer onboarding with a fixed knowledge base. Multi-session chatbots needing consistent context. In-app help assistants where latency directly affects UX.

Clearest RAG wins:

News summarisation. E-commerce product search across a large catalogue. Enterprise knowledge search spanning millions of documents. Healthcare evidence lookup where source attribution is legally required. Any application requiring real-time data or full audit trails.

6. Best Models for Each Approach

Model choice matters differently for each approach. For RAG, embedding model quality drives retrieval accuracy. For CAG, context window size and long-context reasoning performance define what is possible.

Long-context models for CAG

Model Context window Notes
Gemini 2.0 Pro 2M tokens Largest available window. Can hold entire codebases or book-length docs.
Claude 3.5 / 4.x 200K tokens Excellent needle-in-haystack recall. Strong long-context reasoning.
GPT-4 Turbo / o1 128K tokens Performance degrades past ~64K. Good for moderate corpus sizes.
Llama 3.1 405B 128K tokens Open-source self-hosted option. Quality declines after ~32K tokens.

Embedding models for RAG

Model Provider Notes
text-embedding-3-large OpenAI Best accuracy for English RAG. Most widely used in production.
text-embedding-3-small OpenAI 58% cheaper than large with strong performance.
voyage-3 Anthropic/Voyage State-of-the-art on MTEB benchmarks. Recommended with Claude.
BGE-M3 HuggingFace Free, self-hostable. Supports 100+ languages.

7. Production Use Cases

Customer support chatbot (RAG)

Ground answers in product docs, knowledge base articles, and policies. Reduce hallucinations and ensure consistent answers across thousands of tickets. Update the vector DB as policies change with no redeployment needed.

Stack: LangChain + Pinecone + GPT-4o or Claude Sonnet + Zendesk article sync.

In-app help assistant (CAG)

Preload your entire product documentation into context. Zero-latency personalised help as users navigate. Especially effective when combined with user account state for hyper-personalised responses.

Stack: Anthropic Prompt Caching + Claude Sonnet 4 + 200K token system prompt.

CAG for single-contract deep analysis: load the entire contract into context and reason holistically. RAG for case law research across large corpora. Hybrid for large matters requiring both breadth and depth.

Real-time news summarisation (RAG)

Breaking news, market data, live sports events. Knowledge changes hourly. RAG is the only viable option here. Continuously update your vector DB as new content arrives via webhooks or scheduled pipelines.

Enterprise knowledge search (RAG)

Unified access across internal wikis, Confluence, Slack, Notion, Google Drive. Millions of documents, continuous updates. Use hybrid search (semantic + BM25) and metadata filters for access-controlled retrieval.

Code review assistant (CAG)

Preload your entire codebase plus style guides and architecture docs. Engineers ask questions with full project context. Gemini 2.0's 2M token window can hold large repositories in their entirety.

Healthcare clinical decision support (RAG)

RAG over medical literature, drug databases, and clinical guidelines. Source attribution is critical for regulatory compliance. RAG's explicit retrieval step provides the audit trail required by healthcare standards.

Personalised onboarding assistant (CAG)

Load user account config plus product docs into context at session start. A tour-guide AI that knows exactly where the user is and what they have already completed, with zero retrieval latency disrupting the UX.

8. Building RAG: Production Stacks

Orchestration frameworks

Framework Best for
LangChain 50K+ integrations. Rapid prototyping, agents, custom control flows.
LlamaIndex 150+ data connectors. Purpose-built for document-heavy RAG. 40% faster retrieval.
Haystack Production-first. Built-in evaluation, monitoring, A/B testing. Enterprise security.
RAGFlow No-code visual workflow builder. Good for non-developer teams.

Vector databases

Database Best for
Pinecone Zero-ops serverless. 12K QPS, $0.12/1M vectors. Best SaaS multi-tenant isolation.
Weaviate Hybrid search with 2x recall improvement (v1.26). Strong metadata filtering.
Milvus / Zilliz Most cost-efficient at scale ($500/mo vs $1,200). Best for >10M vector workloads.
Chroma Local development. Fast to start, minimal config.
Qdrant Production self-hosting with strong filtering capabilities.
RAG Stack Tiers: MVP vs Production

Production code: hybrid search with reranking

from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import SentenceTransformerRerank

# Hybrid search: 70% semantic + 30% keyword (BM25)
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
    vector_store_query_mode='hybrid',
    alpha=0.7
)

# Rerank to top-3 for LLM context window
reranker = SentenceTransformerRerank(
    model='cross-encoder/ms-marco-MiniLM-L-2-v2',
    top_n=3
)

# Result: p99 latency ~112 ms, 92% answer relevance

9. Building CAG: Implementation Options

Tool Type Notes
HuggingFace Transformers Self-hosted DynamicCache API for KV cache extraction and reuse. Full lifecycle control.
vLLM Self-hosted Production inference with prefix caching built-in. PagedAttention optimises KV memory.
Anthropic Prompt Caching API Cache up to 200K token system prompts. 90% cost reduction on cached tokens.
Gemini Context Caching API Cache up to 2M tokens. Explicit cache management API with TTL controls.

CAG best practices

Keep the system prompt at the top of the cache boundary. Use cache reset functions between sessions. Monitor context utilisation per query. Size chunks for context window headroom. Use Cross-Layer Attention (CLA) for memory optimisation on self-hosted models.

CAG pitfalls to avoid

Exceeding the context window silently truncates content. Models frequently miss mid-document information (the "lost in the middle" problem). Rebuilding the cache on every deploy is expensive. No built-in source attribution creates compliance risk in regulated industries.

10. The Hybrid Architecture

The most capable production systems combine both. Use RAG to identify which document is relevant, then load that document's full content via CAG for deep, zero-latency reasoning.

RAG and CAG Hybrid

The hybrid flow:

  1. User query arrives.
  2. RAG retrieval: search the vector DB, identify the top 1-3 most relevant documents from potentially millions.
  3. CAG load: load the full text of those documents into context, not just the matching chunks.
  4. Deep reasoning: the LLM reasons over the full document with complete context and no retrieval gaps.
  5. Cited answer: return the response with source attribution from the RAG step.

RAG handles the scale problem. CAG handles the reasoning quality problem. Together they address the limitations of each approach in isolation.

There is an emerging pattern worth watching: KV cache retrieval, where pre-computed KV caches are stored per document in a database, then retrieved based on the query. This gives RAG-scale coverage with CAG-speed inference. Architecturally complex, but increasingly viable as tooling matures.


Quick Reference: Key Terms

Term What it is
KV Cache Key-value pairs storing attention computation states. The backbone of CAG's speed advantage.
Vector DB Stores document embeddings for semantic similarity search at scale.
Embeddings Dense vector representations of text. Enable semantic rather than keyword search.
Chunking Splitting documents into segments sized for retrieval, typically 256-512 tokens.
Hybrid search Combining vector similarity search with keyword (BM25) for better recall.
Reranking Cross-encoder model that reorders retrieved documents for precision.
Prompt Caching Caching the processed state of a long system prompt across multiple API calls.

Sources

Chan, J. et al. (2024). Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks. ACM Web Conference 2025.

LlamaIndex + Pinecone production benchmarks (2026): p99 latency 112 ms, 92% answer relevance with hybrid search.

RAG Frameworks comparison (LangChain, LlamaIndex, Haystack, RAGFlow). LLM Practical Experience Hub, September 2025.

Anthropic Prompt Caching documentation (2025): 90% token cost reduction, ~70% TTFT improvement.

Chan et al. findings: CAG outperforms RAG on HotPotQA and SQuAD benchmarks when the full corpus fits in the context window.


Companion piece: Building Your Own MCP Server

RAG vs CAG: The Definitive Guide