Information to separate before mixing long context and RAG¶

For / Key Points

For: Practitioners who want to use RAG or long context at work but are unsure about search cost, latency, and document splitting.

Key points:

RAG and long context should not be chosen as rivals; the first decision is where each kind of information belongs.
Use RAG for information that changes by query, long context for a small number of documents that need full reading, and caching for stable information reused across calls.
Separating these categories before implementation reduces latency, missed evidence, and unnecessary chunking.

Published: 2026-06-26

A team wants an LLM to read meeting notes, product manuals, policy PDFs, FAQs, and support history. When the design is still rough, the question often becomes too binary: should everything go into RAG, or should everything fit into a long context window?

That binary decision comes too early. The first thing to separate is not the technology, but the information. Information retrieved per question, information read in full for the current task, and information reused in the same form across many calls should not share one bucket.

Start with three buckets¶

RAG is a design in which a model retrieves external knowledge before answering and receives that retrieved material as context. Microsoft Learn describes RAG as a pattern that adds retrieved context to the user's prompt so the language model can generate an answer from that context.¹

Long context is closer to passing a large input directly to the model instead of searching first. Google's Gemini API documentation describes long context as work enabled by context windows of one million or more tokens, and frames the context window as a form of short-term memory available to the model.²

Caching sits between those two patterns. Anthropic's prompt caching documentation describes it as a way to resume from repeated prompt prefixes, reducing processing time and cost for repetitive tasks or stable prompt elements.³

Use the following table before implementation.

Information type	Bucket	Examples	Common misuse
Large corpus where the needed passages change by query	RAG	Internal wiki, FAQ, specifications, support history	Passing everything every time and paying for latency
Small set where the full flow matters for one decision	Long context	One contract, one meeting transcript, one design review deck	Chunking too aggressively and breaking continuity
Stable material reused in the same shape many times	Cache	System policy, quality rubric, product constraints	Paying for the same long prefix every call

RAG, long context, and caching are not mutually exclusive. In a real business application, they are usually separate layers inside the same context pipeline.

Keep in RAG¶

RAG fits information where the relevant section changes with each question. HR FAQs, product specifications, incident histories, and knowledge bases are common examples: the corpus is large, but each answer usually needs a limited subset.

RAG's strength is selective retrieval. Microsoft Learn describes the RAG flow as adding retrieved context to the prompt and letting the model generate an answer using that context.¹

The trade-off is that the corpus must be prepared for retrieval. Microsoft's RAG design guide notes that the architecture is straightforward, but design, experimentation, and evaluation require a rigorous approach.⁴ In practice, weak chunking, metadata, embedding models, reranking, or permission filters can keep the right evidence out of the top results.

Keep information in RAG when all of the following are true:

Reading the whole corpus every time is too heavy.
The needed source changes by question.
The source is updated continuously.
The answer needs citations or evidence locations.

When these conditions hold, retrieval should remain a layer in the design.

Put into long context¶

Long context fits a small number of documents where the full flow changes the answer. One contract with exceptions, one meeting transcript with a decision trail, or one design review with assumptions and conclusions can be safer to read as a whole than as isolated chunks.

The value of long context is that it avoids splitting too early. Google lists long context use cases around understanding large unstructured inputs such as documents, videos, and code.² Those cases point to situations where the full structure of the material matters more than a few top-ranked passages.

More context is not always better. Anthropic's context window documentation warns that, as token count grows, accuracy and recall can degrade, and that larger context is not automatically an improvement.⁵

Use long context when these conditions hold:

The target material is small in count.
Order, continuity, and exceptions matter.
Full-document reading is more natural than passage retrieval.
The answer is one-off or low-frequency.

In this zone, trying full-context reading before building a retrieval index can be reasonable.

Move stable material to cache¶

Caching fits information that is long, stable, and reused repeatedly. Examples include an agent constitution, an article quality rubric, a glossary, shared product constraints, or a fixed output contract.

If these are searched through RAG every time, retrieval variance enters the prompt. If they are resent as normal long context every time, the application pays repeatedly for the same input. Caching gives this repeated prefix its own bucket.

Anthropic's prompt caching documentation positions caching as useful for repetitive tasks and prompts with consistent elements.³ Google's context caching documentation similarly describes caching for workflows that pass the same input tokens to a model repeatedly, with performance and cost optimization as the goal.⁶

Cache material when these conditions hold:

The same content is reused often.
The content does not change frequently.
Retrieval variance is undesirable.
The input is long enough that repeated sending is costly.

Putting this material into RAG pollutes the search corpus. Putting it into ordinary long context makes every request heavier. Caching is not knowledge retrieval; it is a home for the stable base of the prompt.

Review checklist before mixing¶

When combining RAG and long context, review the design in this order.

Classify the material. Label each source as retrieval, full-context, or cache.
Shrink the retrieval corpus. Remove fixed rubrics and policies from RAG if they are always needed.
Limit full-context inputs. Only pass full documents when reading them as documents matters.
Create retrieval evaluation queries. Microsoft describes RAG evaluation through multiple metrics over retrieved data, context, and final answers.⁷
Define the fallback path. If retrieval misses evidence, improve RAG. If full context is slow, summarize or cache. If the cache becomes stale, fix the update rule.

If this table does not exist yet, it is too early to decide whether RAG or long context is the answer. Information routing comes first.

Common misunderstandings¶

Does long context make RAG unnecessary?¶

No. Long context increases how much material the model can read in the current call. RAG chooses what material should be read.

If the corpus is always small and can be read in full, RAG can be overkill. Once the corpus grows, changes, has permissions, or needs citations, a retrieval layer remains useful.

Does RAG remove the need for full-document reading?¶

No. Chunking is useful, but it cuts context. Microsoft's chunking guidance notes that chunks that are too small may not contain enough context to answer a query, and that relevant context can span multiple chunks.⁸

Contracts, meeting records, and design decisions often depend on order and exceptions. One workable design is to use RAG to identify the candidate document, then use long context to read that document in full.

Is caching a replacement for RAG?¶

No. Caching is reuse, not retrieval. It should hold stable instructions, rubrics, and output contracts, not information that should be searched.

Summary¶

Before mixing RAG and long context, separate information into three buckets.

Use RAG for information where the needed section changes by question.
Use long context for a small number of documents whose full flow matters.
Use caching for stable information reused in the same form.

This classification keeps the design away from both "RAG is dead" and "long context solves everything" narratives. The real design question is not which technology is stronger; it is which information enters the model, when, and in what shape.

Microsoft Learn, "Retrieval Augmented Generation (RAG) in Azure AI Foundry" ↩↩
Google AI for Developers, "Long context" ↩↩
Anthropic, "Prompt caching" ↩↩
Microsoft Learn, "Design and Develop a RAG Solution" ↩
Anthropic, "Context windows" ↩
Google AI for Developers, "Context caching" ↩
Microsoft Learn, "Large Language Model End-to-End Evaluation Phase" ↩
Microsoft Learn, "Develop a RAG Solution - Chunking Phase" ↩