You are viewing a preview of this lesson. Sign in to start learning
Back to 2026 Modern AI Search & RAG Roadmap

Graph Rag

The big shift since 2024 is that Graph RAG stopped being prohibitively expensive. Microsoft's original GraphRAG cost around $33K to index large datasets, which made it impractical for most teams, and most of 2025 was research aimed at fixing that. By 2026 it's mostly solved. Medium LazyGraphRAG is now the default starting point. It defers LLM summarisation to query time and only does lightweight graph construction during indexing. Indexing costs match vector RAG — 0.1% of full GraphRAG — and global queries match full GraphRAG quality at more than 700x lower cost. Microsoft Discovery and Azure Local shipped it in public preview in mid-2025, with open-source library integration landing in Q1–Q2 2026. Production deployments are reporting 70–97% cost reductions vs full GraphRAG with equal or better answer quality. Microsoft + 2 Hybrid routing has won the architecture debate. Nobody serious is running pure GraphRAG anymore. The pattern is a query classifier that routes simple factual lookups to vector RAG, multi-hop relationship queries to GraphRAG, and uses LazyGraphRAG to make the global-query path affordable. Microsoft's BenchmarkQED found LazyGraphRAG beat every competing method — including vector RAG with a 1M-token context window — using the same generative model, suggesting graph structure at index time matters more than raw context length at query time. That's the headline result driving 2026 architecture choices. Medium Where it actually helps. Cross-document multi-hop reasoning, regulatory/compliance, research synthesis, competitive intelligence — anywhere the answer requires chaining facts across separate documents. One reported production case: accuracy on complex multi-hop questions jumping from 43% to 91% after rebuilding from vector RAG to GraphRAG with proper entity resolution and hierarchical community detection. For "what does doc X say about Y" it remains overkill. Medium Tooling state. Microsoft's GraphRAG library hit 1.0 in late 2024, and the latest release in March 2026 brought performance optimizations and new query capabilities. Neo4j and FalkorDB have managed offerings. On the agentic side, LangGraph + LlamaIndex retrieval is the common production stack — LlamaIndex does indexing/chunking, LangGraph runs the control flow as a stateful directed graph. Programming Helper What's still hard. Entity disambiguation, multi-index synchronisation when documents update, and re-indexing cost on streaming data — though LazyGraphRAG largely sidesteps the last one. Multimodal GraphRAG (images, tables) is still mostly research. The practical 2026 advice across the recent literature is consistent: don't reach for GraphRAG by default, route by query type, and if you do need it, start with LazyGraphRAG rather than the original.

Last generated

Why Graph RAG Exists and When It Actually Helps

Imagine you're building a compliance tool for a financial institution. An analyst asks: "Which of our third-party vendors share regulatory exposure with the suppliers flagged in last quarter's audit?" You have thousands of documents — contracts, audit reports, regulatory filings, supplier assessments. Your vector RAG system dutifully retrieves the passages most semantically similar to the query. It surfaces relevant-sounding paragraphs about vendors and regulations. But it cannot answer the question. The answer isn't in any single passage — it's across several of them, connected by a chain of relationships that the retrieval system never built. This is the problem Graph RAG was designed to solve.

Understanding why Graph RAG exists means understanding, precisely, where standard retrieval breaks down — not in a vague "sometimes it's not good enough" way, but in a structural way that explains exactly which queries will fail and why. That's what this section is about.

The Structural Limit of Vector Retrieval

Vector RAG works by encoding documents into chunks, embedding those chunks as vectors, and at query time, retrieving the chunks whose embeddings are most similar to the query embedding. It is genuinely excellent at a large class of retrieval problems: finding a definition, surfacing a relevant paragraph, locating what a document says about a topic. The underlying assumption is that the right answer lives inside a passage — or at worst, across a few passages that are semantically related to the query.

That assumption breaks for multi-hop reasoning. Consider the chain of inference required to answer "which suppliers share a regulatory risk with vendor X":

Step 1: Identify which regulations apply to vendor X.
        → This fact lives in Document A (vendor assessment).

Step 2: Find which other suppliers are subject to those same regulations.
        → This fact lives in Documents B, C, D (separate supplier filings).

Step 3: Confirm which of those suppliers are in the current vendor list.
        → This fact lives in Document E (procurement registry).

Final answer: The intersection of steps 2 and 3, filtered through step 1.

No single chunk contains this answer. Worse, the chunks from Documents B, C, and D may not be semantically similar to the query at all — they mention regulation names and supplier codes, not the phrase "regulatory risk with vendor X." Vector similarity retrieval will rank them low or miss them entirely. The system isn't failing because it lacks intelligence; it's failing because semantic similarity is not the same as logical relevance in a multi-hop chain.

💡 Mental Model: Vector RAG is like searching a library by browsing shelves near the topic you care about. Graph RAG is like following a citation trail — the answer you need might be three papers removed from your starting point, connected by references, not by subject proximity.

What Graph RAG Actually Builds

Graph RAG addresses this by changing what happens at index time. Rather than chunking documents and embedding them, it extracts entities (people, organizations, regulations, products, locations, concepts) and relationships between them from the source documents. These become nodes and edges in a knowledge graph.

  [Vendor X] ──── subject_to ────► [Regulation 17-B]
                                         │
                               also_applies_to
                                         │
                                         ▼
                              [Supplier Corp A]
                              [Supplier Corp B]
                                         │
                                  listed_in
                                         │
                                         ▼
                           [Active Vendor Registry]

With this structure in place, the query "which suppliers share regulatory exposure with vendor X" becomes a graph traversal: start at the node for Vendor X, follow the subject_to edge to Regulation 17-B, then follow also_applies_to edges to other entities, then filter by membership in the vendor registry. The answer emerges from the path through the graph, not from ranking passages.

This is the core architectural difference: Graph RAG externalizes the relational structure of your documents into a queryable graph, so that multi-hop inference becomes graph traversal rather than guesswork.

Where Graph RAG Produces Real Gains

The performance advantage of Graph RAG is not evenly distributed — it concentrates sharply in specific workload types. Understanding this distribution is what separates pragmatic adoption from cargo-cult engineering.

Cross-Document Multi-Hop Reasoning

This is the core use case. Any question whose correct answer requires chaining at least two facts drawn from separate source documents is a candidate. Examples:

  • 🔧 "Which clinical trials involve compounds that were flagged in the 2022 safety review and are currently in Phase III?" — requires linking safety documents to trial registries.
  • 📚 "Which patent holders are also defendants in open litigation involving the same technology class?" — requires traversing patent filings and legal documents.
  • 🎯 "Which infrastructure components have a known CVE that affects a vendor listed in our critical supplier register?" — requires linking vulnerability databases to procurement records.

In documented production migrations from pure vector RAG to Graph RAG with proper entity resolution and hierarchical community detection, accuracy on complex multi-hop questions has improved dramatically — in some reported cases jumping from below 50% to above 90%. These are not marginal gains. They reflect the structural difference between a system that can traverse relationships and one that cannot.

Regulatory and Compliance Analysis

Compliance work is structurally multi-hop: regulations reference other regulations, which apply to certain entity types, which map to specific internal processes, which are owned by identifiable teams. The "correct answer" to a compliance query is almost always a path through that network. Vector RAG retrieves relevant-sounding passages; Graph RAG can actually traverse the regulatory dependency chain.

Research Synthesis

Literature synthesis — "what does the current body of evidence say about X, across papers that may not use the same terminology" — benefits from the graph's ability to link concepts through co-citation and shared entity references, rather than relying on vocabulary overlap. A paper using "myocardial infarction" and one using "heart attack" may not retrieve together by embedding similarity but will be linked through shared entity nodes if the graph is built correctly.

💡 Real-World Example: Consider a research team synthesizing regulatory guidance across pharmaceutical submissions. Documents from different years use different terminology for the same compounds. Vector retrieval misses connections because the vocabulary drifted. A knowledge graph built with entity resolution — recognizing that "Compound XR-7," "formulation 7-XR," and the branded name refer to the same entity — surfaces connections that embedding similarity cannot.

Where Graph RAG Does Not Help

This is the part that gets skipped in most Graph RAG enthusiasm, and skipping it leads to expensive mistakes.

For single-document lookups and straightforward factual retrieval, Graph RAG adds cost and complexity without meaningful accuracy improvement. If someone asks "what does the Q3 earnings report say about revenue growth," the answer is in one document. Vector RAG finds the relevant chunk. Graph RAG would build a knowledge graph of entities in that document, traverse it, and arrive at the same passage — with dramatically more indexing overhead.

Wrong thinking: "Graph RAG is more sophisticated, so it should work better across the board."

Correct thinking: "Graph RAG solves a specific structural problem — multi-hop traversal. For problems that don't have that structure, it's overhead with no return."

The same logic applies to narrow question-answering against a single source: chatbots answering FAQ-style questions, document summarization, definition lookups, semantic search over product catalogs. These are vector RAG's natural territory.

⚠️ Common Mistake — Mistake 1: Reaching for Graph RAG because answers seem to require "understanding" the documents more deeply. Depth of understanding is not the bottleneck in most retrieval failures — the bottleneck is either recall (finding the right chunks) or relational traversal (chaining facts). Graph RAG only addresses the second. If your failure mode is recall, better chunking strategy and embedding models will help more.

The Decision Rule

All of this reduces to a practical heuristic that is straightforward to apply before you commit to any architecture:

🎯 Key Principle: Reach for Graph RAG when the correct answer requires chaining at least two facts from separate source documents. Default to vector RAG otherwise.

You can operationalize this by sketching the reasoning path required to answer your representative queries:

Query analysis template:

1. What is the final answer I need?
2. What is the minimum set of facts required to construct it?
3. Do those facts live in one document or multiple?
4. Are the facts *linked by relationships* (entity A relates to entity B)
   or *co-located by topic* (both mention concept X)?

If (3) = multiple AND (4) = linked by relationships → Graph RAG candidate
If (3) = one document OR (4) = co-located by topic  → Vector RAG is sufficient

This is a heuristic, not an exhaustive test — edge cases exist, and the full routing architecture (covered in Section 3) handles the production complexity of mixing both. But as a first-pass decision filter, it catches the majority of cases correctly.

📋 Quick Reference Card: Vector RAG vs. Graph RAG by Query Type

Query Type 🎯 Recommended Approach 📌 Why
🔍 "What does doc X say about Y?" Vector RAG Single-document, topic lookup
📚 FAQ / definition lookup Vector RAG Semantic similarity sufficient
🔗 Multi-hop: "Which A's are linked to B through C?" Graph RAG Requires relationship traversal
🔒 Compliance chain analysis Graph RAG Regulatory dependencies are graph-structured
📊 Research synthesis across many sources Graph RAG Entity co-reference across documents
🧠 Summarize this document Vector RAG No cross-document traversal needed
🔧 "Who owns the system that depends on component Z?" Graph RAG Dependency graph traversal

🤔 Did you know? The failure mode Graph RAG addresses — an agent retrieving individually relevant chunks that don't combine into a coherent answer — has a name in the retrieval literature: context fragmentation. The retrieved passages each score well on similarity but don't contain the relational connective tissue that makes the answer derivable. Graph RAG doesn't retrieve more context; it restructures what's retrievable so the connective tissue exists at query time.

Setting Up the Right Mental Model Before Going Deeper

Before moving into the mechanics of how Graph RAG builds and queries its knowledge graph, it's worth anchoring the intuition firmly: Graph RAG is not a universally better version of RAG. It is a solution to a specific class of retrieval problems that vector similarity cannot handle structurally. Its costs are real — indexing is more complex, entity extraction introduces its own failure modes, and maintaining the graph as documents update creates synchronization challenges (all covered in later sections).

The practical posture that has emerged in production systems is not "use Graph RAG for everything" but "know which queries need it, route those queries to it, and let vector RAG handle the rest." That hybrid routing pattern, and the LazyGraphRAG approach that made it economically viable, is what the next two sections address.

What matters at this stage is the diagnostic clarity: if you can identify a query whose answer requires traversing a relationship chain across documents, you've identified a Graph RAG candidate. If you can't draw that chain, you're probably solving a recall problem, not a traversal problem — and Graph RAG won't fix it.

🧠 Mnemonic: CHAIN — if the answer requires Chaining facts, Hopping documents, Across entities that are Interrelated by Name, reach for Graph RAG. Otherwise, vector RAG is your starting point.

How Graph RAG Works: Indexing, Communities, and Query Execution

To understand what Graph RAG actually does differently, you need a precise picture of two distinct pipelines: what happens before a query arrives (index time), and what happens when a query actually executes. Most confusion about Graph RAG comes from treating it as a single monolithic system rather than these two separable stages — each with its own cost profile and tradeoffs. This section walks through both in detail, including how community detection bridges them.

The Index-Time Pipeline: From Documents to a Property Graph

In a standard vector RAG system, indexing means chunking documents and embedding those chunks into a vector store. Graph RAG adds a fundamentally different step: it uses an LLM to read through source text and extract entities (named things — people, organizations, concepts, locations, technologies) and relationships between them. The result is a property graph, where nodes represent entities and edges represent typed relationships between them.

Here's what that pipeline looks like concretely:

Source Documents
      │
      ▼
┌─────────────────────────────────┐
│  LLM Entity & Relation Extractor│
│  (reads each chunk)             │
│  Extracts: (entity, type)       │
│  Extracts: (entity A)──[rel]──► │
│            (entity B)           │
└──────────────┬──────────────────┘
               │
               ▼
┌─────────────────────────────────┐
│  Entity Resolution              │
│  "OpenAI" + "Open AI" → one node│
│  "GPT-4" + "GPT4" → one node    │
└──────────────┬──────────────────┘
               │
               ▼
┌─────────────────────────────────┐
│  Property Graph                 │
│  Nodes: entities + attributes   │
│  Edges: relationships           │
│  Provenance: chunk_id stored    │
│  on each edge                   │
└─────────────────────────────────┘

Each edge in this graph stores provenance — a reference back to the source chunk from which the relationship was extracted. This matters for retrieval: when a query traverses the graph and arrives at a relationship, it can pull the original source text to support its answer rather than relying purely on the graph structure.

Entity resolution is the deduplication step that merges nodes that refer to the same real-world entity. Without it, you'd end up with separate nodes for "OpenAI," "Open AI," "OpenAI Inc.," and "OpenAI LP" — all disconnected, even though documents discussing each are talking about the same organization. Good entity resolution is what allows multi-hop traversal to actually work across a messy, real-world corpus. (Section 4 covers where entity resolution fails in practice — it's one of the harder operational problems.)

💡 Mental Model: Think of the index-time LLM extraction like having a researcher read every document and fill out a structured form: "Entity A relates to Entity B in way W, and I found this in chunk #42." The property graph is the accumulated filing cabinet of all those forms.

Community Detection: Structuring the Graph for Retrieval

A raw property graph with tens of thousands of nodes is not directly queryable at useful granularity for corpus-wide questions. The next major step is community detection — grouping nodes into clusters based on their connectivity. The most commonly used algorithm is Leiden, a hierarchical community detection method that produces communities at multiple resolutions simultaneously.

The key word here is hierarchical. Leiden doesn't just split nodes into one flat set of clusters. It produces a nested structure:

Level 0 (finest): Small, tightly connected clusters
  e.g., [OpenAI, GPT-4, Sam Altman, RLHF]
        [Anthropic, Claude, Constitutional AI]

Level 1 (coarser): Broader thematic groupings
  e.g., [AI Labs, Frontier Models, Safety Research]

Level 2 (coarsest): Domain-level communities
  e.g., [Artificial Intelligence Industry]

This hierarchy turns out to be directly useful for query execution. Local, specific questions are answered by traversing fine-grained Level 0 clusters close to the queried entities. Broad, corpus-wide questions — "what are the major themes across all these regulatory documents?" — are answered by aggregating across the coarser Level 2 community summaries rather than reading every node.

🎯 Key Principle: The community hierarchy maps directly onto query scope. Fine-grained communities answer specific relationship questions; coarse communities answer synthesis and landscape questions. A retrieval system that only has one resolution will either over-retrieve for simple questions or under-synthesize for broad ones.

Full GraphRAG: Pre-Generated Summaries and Their Cost

In the original full GraphRAG approach (the design Microsoft published and that drew significant early attention), the system pre-generates LLM summaries for every community at every level of the hierarchy — at index time. Each community summary is a condensed description of what that cluster of entities is about and how they relate.

This approach has a clear benefit: at query time, global questions can be answered very quickly by aggregating over pre-computed summaries rather than doing live LLM inference across the entire corpus structure.

The problem is cost. Generating LLM summaries for every community at every hierarchical level, across a large corpus, means a very large number of LLM calls during indexing. For a large document set, this indexing cost became prohibitively high for most teams — sometimes orders of magnitude beyond what a comparable vector RAG pipeline would cost. This is the core economics problem that drove most Graph RAG research through 2025.

⚠️ Common Mistake: Teams sometimes assume that Graph RAG's high cost is inherent to the approach rather than specific to the full GraphRAG design choice. The cost is not a property of graph-based retrieval in general — it's a property of pre-generating summaries for all communities eagerly. LazyGraphRAG proves this by eliminating most of it.

LazyGraphRAG: Deferring Summarization to Query Time

LazyGraphRAG is the architectural variant that solved the indexing cost problem. The core insight is straightforward: instead of pre-generating community summaries at index time, build the graph structure (entities, relationships, community membership) without LLM-generated summaries, and only invoke LLM summarization when a query actually needs it.

The indexing phase becomes much lighter:

Full GraphRAG Index Time:          LazyGraphRAG Index Time:

  Extract entities/relations         Extract entities/relations
         │                                  │
  Run community detection            Run community detection
         │                                  │
  [For EVERY community]:             Store community structure
    Generate LLM summary             (no LLM summary calls)
         │                                  │
  Store all summaries                Done. ✓
         │
  Done. (after many LLM calls)

At query time, LazyGraphRAG identifies the relevant communities for the query and then generates summaries on demand — only for the communities actually needed to answer that specific question. For a global query that genuinely needs broad synthesis, this might still trigger many summary generations. But for most queries, it's far fewer than pre-generating everything.

The practical result reported from production deployments is that indexing costs become comparable to those of vector RAG, while answer quality on global synthesis queries matches full GraphRAG. This is why LazyGraphRAG has become the default starting point for new Graph RAG deployments rather than a niche option.

🤔 Did you know? The architectural intuition behind LazyGraphRAG mirrors a classical database principle: materialized views (pre-computed query results) improve read speed but cost storage and maintenance. LazyGraphRAG trades the materialized-view approach (pre-computed summaries) for on-demand computation — the right tradeoff when query patterns are unpredictable and indexing cost is the primary constraint.

With the graph structure built and communities established, query execution in Graph RAG takes one of two primary modes depending on the question type. Understanding these modes concretely is essential for building a working system.

Local Search: Entity Neighborhood Traversal

Local search handles questions anchored to specific entities or relationships: "What regulatory bodies oversee OpenAI?" or "How does the RLHF technique relate to Constitutional AI?"

The execution flow:

Query: "What is the relationship between RLHF and Constitutional AI?"
      │
      ▼
  Entity linking: identify 'RLHF', 'Constitutional AI' in graph
      │
      ▼
  Retrieve neighborhood: 1-2 hop neighbors of both nodes
      │
      ▼
  Follow provenance edges to source chunks
      │
      ▼
  Assemble context from graph structure + source text
      │
      ▼
  LLM generates answer grounded in assembled context

The graph structure here does real work: it surfaces the path between entities across documents that may never mention both in the same chunk. A pure vector search might retrieve chunks about RLHF and chunks about Constitutional AI but miss the documents that explicitly compare them — because those comparison documents might not rank highly in embedding similarity for either term alone.

Global Search: Community Summary Aggregation

Global search handles synthesis questions across the entire corpus: "What are the major safety approaches described across these papers?" or "Summarize the competitive landscape across all these analyst reports."

The execution flow:

Query: "What are the dominant themes in AI safety research across this corpus?"
      │
      ▼
  Identify relevant community level (typically coarser resolution)
      │
      ▼
  [LazyGraphRAG]: generate summaries for relevant communities on demand
  [Full GraphRAG]: retrieve pre-computed summaries
      │
      ▼
  Map phase: each community summary generates a partial answer
      │
      ▼
  Reduce phase: LLM aggregates partial answers into final response
      │
      ▼
  Final synthesized answer

The map-reduce structure is deliberate. Because no single LLM context window can hold the entire corpus, the system distributes the synthesis work across community summaries and then aggregates. This is fundamentally different from simply stuffing more documents into a long context — the graph's community structure has already organized information thematically before the LLM ever sees it.

💡 Real-World Example: Consider a team building a research assistant over thousands of clinical trial documents. A question like "What were the adverse events reported for drug X in the Phase 2 trials?" routes to local search — it's entity-anchored and relationship-specific. A question like "What patterns characterize trials that failed to reach primary endpoints?" routes to global search — it requires synthesizing signals across the entire corpus that no individual document contains. These two questions have meaningfully different retrieval needs, and treating them the same (as pure vector RAG must) leaves accuracy on the table.

The Query Classifier in Hybrid Systems

In production, a query classifier sits upstream of both retrieval modes and routes each incoming query. Simple factual lookups often go to vector RAG directly — no graph traversal needed. Multi-hop relationship questions route to local graph search. Corpus-wide synthesis questions route to global search via LazyGraphRAG. This hybrid routing is covered in depth in Section 3; the point here is that the two Graph RAG query modes aren't competing alternatives — they're complementary, designed for different scopes of question.

⚠️ Common Mistake: Defaulting all Graph RAG queries to global search because it feels more comprehensive. Global search is expensive (even with LazyGraphRAG) and returns broad synthesis, which is wrong for specific lookups. A question about one entity's relationships doesn't need corpus-wide community aggregation — routing it to global search will produce a less precise answer at higher cost.

Putting the Pieces Together

The full Graph RAG architecture — from document to answer — chains these stages in a way that each step sets up the next:

┌──────────────────────────────────────────────────────────────┐
│                      INDEX TIME                              │
│                                                              │
│  Documents → LLM Extraction → Property Graph                 │
│                                    │                         │
│                              Entity Resolution                │
│                                    │                         │
│                          Leiden Community Detection           │
│                          (hierarchical, multi-level)          │
│                                    │                         │
│              [Full GraphRAG]   [LazyGraphRAG]                 │
│              Pre-generate all   Store structure only          │
│              summaries          (no LLM calls here)           │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                      QUERY TIME                              │
│                                                              │
│  Query → Classifier → Local Search  ──► Entity neighborhoods │
│                     └─► Global Search ──► Community summaries │
│                                          (generated on demand │
│                                           in LazyGraphRAG)    │
│                                    │                         │
│                             Grounded Answer                   │
└──────────────────────────────────────────────────────────────┘

The architecture's power comes from the fact that community detection at index time does intellectual work — organizing a corpus thematically — that pure embedding-based retrieval doesn't do. This is what allows global queries to be answered by synthesizing community summaries rather than re-reading everything. And it's why, as benchmarks have shown, graph structure at index time can matter more for synthesis quality than simply expanding context window size at query time. Long context windows give the model more to read; community structure gives the model pre-organized themes to reason over. For synthesis-heavy questions, organization wins.

📋 Quick Reference Card: Graph RAG Pipeline Summary

🔧 Stage 📚 What Happens 🎯 Output
🔧 LLM Extraction LLM reads chunks, pulls entities and relationships Raw entity/relation list
🔧 Entity Resolution Deduplicates nodes referring to same entity Clean property graph
📚 Community Detection Leiden algorithm clusters nodes hierarchically Multi-level community structure
🎯 Summary Generation Full GraphRAG: at index time; LazyGraphRAG: at query time Community summaries
🔧 Local Search Traverses entity neighborhoods + provenance Grounded specific answer
🔧 Global Search Aggregates community summaries via map-reduce Synthesized broad answer

This architecture is the foundation for understanding both the production hybrid routing patterns in Section 3 and the operational failure modes in Section 4. The mechanics here — especially provenance tracking, entity resolution quality, and the distinction between local and global query modes — are exactly what breaks down in real systems under non-ideal conditions.

Production Architecture: Hybrid Routing and the LazyGraphRAG Default

The question facing most teams in 2026 is not "should I use Graph RAG?" but "how do I wire Graph RAG into a system that also does everything else well?" Pure Graph RAG deployments — where every query goes through full community-detection-backed graph traversal — have largely disappeared from production, replaced by a routing architecture that treats different retrieval strategies as specialized tools for different query shapes. Understanding that architecture, and the tooling that implements it, is the practical center of Graph RAG work today.

The Routing Architecture: Why Pure Graph RAG Lost

Full Graph RAG (the original Microsoft implementation) builds rich community summaries at index time, which makes it excellent at answering broad, synthesis-style questions but expensive to run and overkill for simple lookups. Vector RAG, conversely, handles direct factual retrieval cheaply and accurately but falls apart on multi-hop reasoning. Running either system alone means either overpaying for simple queries or failing on complex ones.

The production answer is hybrid routing: a query classifier that sits in front of multiple retrieval backends and dispatches each query to the path best suited to answer it. This is not a theoretical best-practice — it is the pattern that has emerged because teams repeatedly discovered that routing reduces cost without sacrificing quality on the query types that matter.

🎯 Key Principle: Route by query shape, not by default. The retrieval strategy should match the structural complexity of what's being asked, not the sophistication of the system building it.

The three main routing paths cover most production traffic:

Incoming Query
      │
      ▼
┌─────────────────────┐
│   Query Classifier  │
│  (LLM or rule-based)│
└─────────────────────┘
      │
      ├──────────────────────────────────────────┐
      │                                          │
      ▼                                          │
 Is it a simple factual                          │
 lookup or keyword match?                        │
      │                                          │
     YES                                        NO
      │                                          │
      ▼                                          ▼
┌──────────────┐              Is it a multi-hop relationship query?
│  Vector RAG  │                         │
│ (fast, cheap)│             ┌───────────┴────────────┐
└──────────────┘            YES                       NO
                             │                         │
                             ▼                         ▼
                   ┌──────────────────┐   ┌───────────────────────┐
                   │  GraphRAG Local  │   │ LazyGraphRAG Global   │
                   │  Search          │   │ Search (synthesis /   │
                   │ (entity-centric) │   │ corpus-wide queries)  │
                   └──────────────────┘   └───────────────────────┘

The three paths map to distinct query types:

  • 🔍 Vector RAG handles direct lookups: "What does the policy say about overtime?" The answer lives in one or two chunks, and semantic similarity retrieval finds it efficiently.
  • 🕸️ GraphRAG local search handles entity-centric multi-hop queries: "How are the FDA approval criteria connected to the drug interaction warnings mentioned across these trial reports?" The answer requires traversing relationships between named entities across documents.
  • 🌐 LazyGraphRAG global search handles synthesis queries: "What are the recurring themes in regulatory risk across this document corpus?" No single passage answers this — it requires aggregating signals across the full graph.

LazyGraphRAG: The Default Starting Point

LazyGraphRAG is a reformulation of the original Graph RAG pipeline that defers LLM-powered summarization from index time to query time. In the original approach, community summaries were generated eagerly during indexing — every community in the graph received an LLM-generated description, which accumulated cost proportionally to corpus size. LazyGraphRAG instead builds only the lightweight graph structure at index time (entity extraction and relationship edges) and generates summaries on demand, covering only the communities relevant to a specific query.

The practical consequence is dramatic: indexing costs fall to roughly the same order of magnitude as vector RAG, while global query quality — according to Microsoft's internal benchmarking work, BenchmarkQED — matches or exceeds full GraphRAG. The same benchmark found LazyGraphRAG outperforming vector RAG with a very large context window using the same underlying generative model, which is the result that has driven 2026 architecture decisions most forcefully.

🤔 Did you know? The BenchmarkQED result suggests that graph structure captured at index time contributes more to answer quality on synthesis queries than simply supplying more raw text at inference time. A model given a well-structured graph of relationships outperforms the same model given a longer, unstructured document dump — which means the organization of knowledge matters as much as the volume of it.

This is a concrete and somewhat counterintuitive finding. It implies that teams spending effort on longer context windows to handle corpus-wide queries may be solving the wrong problem for that query type.

Production deployments migrating from full GraphRAG to LazyGraphRAG have reported cost reductions in the range of 70–97%, with answer quality on evaluated query sets matching or improving. The variance is wide because it depends heavily on corpus size and query distribution — teams with larger corpora and more synthesis-heavy query mixes see larger savings.

⚠️ Common Mistake: Assuming LazyGraphRAG is always cheaper at query time too. The cost transfer from index to query time means that if your system receives many global synthesis queries against a large corpus, per-query costs can accumulate. Model the expected query distribution before committing to architecture.

The Query Classifier: Rules, Embeddings, or LLM?

The classifier is what makes hybrid routing work, and it is worth thinking carefully about its implementation because a poor classifier wastes the entire routing investment.

Three implementation patterns exist at different points on the cost-accuracy tradeoff:

Rule-based classifiers use heuristic signals: query length, presence of relational keywords ("how does X relate to Y", "what connects", "across all documents"), or named entity density. They are fast and cheap but brittle — edge cases accumulate, and maintaining rules as query patterns evolve becomes a maintenance burden.

Embedding-based classifiers train a lightweight classifier on labeled query examples, using embeddings as features. This generalizes better than rules and adds minimal latency (~1–5ms with a small model). The main cost is labeling a representative training set, which requires examples of each query type from your actual user population.

LLM-based classifiers prompt a fast, small language model to categorize the query before routing. This handles nuanced queries well but adds latency (typically 200–500ms) and a non-trivial per-query cost. For high-volume applications, this is often prohibitive.

💡 Pro Tip: Start with a rule-based classifier to get to production, instrument every routing decision with the query text and which path was taken, and use that log data to train an embedding classifier after two to four weeks of real traffic. You'll have labeled data from the real query distribution rather than synthetic examples.

A common production pattern uses a two-stage classifier: a fast rule pass that handles obvious cases (queries with explicit multi-hop markers go straight to GraphRAG; single-entity factual queries go to vector RAG), with an LLM call only for the ambiguous middle tier. This keeps LLM classifier costs low while preserving quality on hard cases.

The Production Tooling Stack

The standard stack combines three layers: a graph database, an indexing framework, and a control flow framework.

Graph database layer: Microsoft's GraphRAG open-source library manages graph storage and query execution and supports LazyGraphRAG mode natively. Teams that want managed hosted graph storage with vector hybrid capabilities typically reach for Neo4j (which offers both graph traversal and vector similarity in the same store) or FalkorDB (optimized for low-latency graph queries with vector support). The choice between self-managed and hosted depends on operational preference rather than capability — both support the retrieval patterns described here.

Indexing layer: LlamaIndex has become the standard choice for document ingestion, chunking, entity extraction, and graph index construction. Its graph index abstractions handle the entity resolution and relationship extraction pipeline, and it integrates with both Neo4j and FalkorDB as storage backends. Critically, LlamaIndex's chunking and entity extraction happen at index time, so the quality of entity resolution at this stage propagates forward into query quality — poor entity extraction here creates ambiguity problems that routing cannot fix later.

Control flow layer: LangGraph manages the routing logic as a stateful directed graph with conditional edges. Each node in the LangGraph application represents a step in the pipeline (classify query, execute vector retrieval, execute graph retrieval, synthesize answer), and conditional edges implement the routing logic. The stateful nature of LangGraph means mid-query context (partial results, retrieved entities, graph subgraphs) persists across steps and can inform later decisions in the same query — for example, an initial vector retrieval can identify candidate entities that a subsequent GraphRAG local search then expands.

LangGraph Control Flow

[classify_query]
      │
      ├── "vector" ──► [vector_retrieve] ──► [synthesize]
      │
      ├── "local" ──► [extract_entities] ──► [graph_local_retrieve] ──► [synthesize]
      │
      └── "global" ──► [lazy_community_select] ──► [graph_global_retrieve] ──► [synthesize]
                                                            │
                                                            ▼
                                              (state carries retrieved
                                               graph subgraph forward)

This architecture is a useful simplification — real production graphs often include additional nodes for reranking, fallback handling, and answer validation, but the three-path routing structure shown here covers the core logic. (Edge cases like query decomposition for compound questions add complexity not shown here.)

Sizing and Deployment Considerations

Before wiring up the full stack, two questions determine whether the architecture is proportionate to the problem.

Query distribution matters more than corpus size. A 10-million-document corpus where 95% of queries are simple factual lookups should be predominantly vector RAG with a GraphRAG path that rarely activates. The routing layer is not free — maintaining two or three retrieval indices, synchronizing them, and paying for a classifier on every query has overhead. If your query analysis shows fewer than 10–15% of queries actually need graph traversal, the complexity cost may outweigh the quality gain on that minority.

Index synchronization is the operational risk. When documents update, the vector index and the graph index can diverge. The vector index update is incremental and fast; the graph index update may require re-extracting entities and re-running community detection on affected subgraphs. LazyGraphRAG reduces this pain significantly because community summaries are not pre-computed — only the structural graph needs updating — but entity resolution changes on document update can still ripple through the graph in non-obvious ways. (This failure mode is covered in depth in the next section.)

Wrong thinking: "I'll index everything into GraphRAG and let the router figure out what to use."

Correct thinking: "I'll index into vector RAG first, profile my query types, and add GraphRAG indices only for the document subsets and query patterns that demonstrably need them."

The practical 2026 default is: start with vector RAG for the full corpus, add a LazyGraphRAG index over the subset of documents that contain the dense entity relationships driving your complex queries, and route to it only for multi-hop and synthesis query types. This is not the most architecturally elegant solution, but it is the one that teams have found easiest to operate and evolve without accumulating runaway indexing costs.

💡 Real-World Example: A team building a regulatory compliance assistant might index all 50,000 policy documents into a vector store for quick clause lookups, but build a LazyGraphRAG index only over the 800 cross-referenced regulatory frameworks where multi-hop questions about interdependencies are common. The routing classifier sends "what does section 4.2 say about X" to vector RAG and "which regulations conflict with each other on topic Y" to LazyGraphRAG — and the vast majority of day-to-day traffic hits the cheaper path.

📋 Quick Reference Card: Routing Decision Summary

Query Type Example 🎯 Route To 💡 Why
🔍 Direct lookup "What is the refund policy?" Vector RAG Single passage answers it
🕸️ Multi-hop relational "How does regulation A constrain entity B's obligations under contract C?" GraphRAG local Requires entity traversal
🌐 Corpus-wide synthesis "What are the dominant risk themes across all filings?" LazyGraphRAG global No single passage; needs aggregation
🔄 Compound (multi-type) "Summarize how X is defined, and how that definition affects Y across documents" Decompose, then route each sub-query Mixed structural needs

The architecture described in this section handles most production Graph RAG requirements — but it surfaces several failure modes that require separate attention: entity disambiguation errors that corrupt the graph, synchronization gaps when documents update, and the temptation to over-apply GraphRAG to queries that don't need it. Those are the subjects of the next section.

What Goes Wrong: Entity Disambiguation, Sync, and Scope Creep

Graph RAG systems fail in ways that are qualitatively different from vector RAG failures. A poorly tuned embedding model still returns something — usually a plausible-looking chunk that at least addresses the right topic. A corrupted knowledge graph returns confident, fluent answers assembled from structurally broken relationships. The failure is silent, and that silence is the real hazard. This section maps the four most common failure modes in enough concrete detail that you can recognize and prevent them before they reach production.

Failure Mode 1: Entity Disambiguation — The Silent Graph Corruptor

Entity disambiguation is the process of deciding whether two textual mentions refer to the same real-world entity. During graph construction, an entity extraction step identifies named entities (people, organizations, products, concepts) and the relationships between them. The problem is that language is ambiguous, and extractors see text, not reality.

Consider a corpus that discusses both Apple Inc. and apple orchards. Without explicit disambiguation logic, a naive extractor will create a single node called Apple and route all associated edges — Apple acquired Beats, Apple grows best in cool climates, Apple released a new chip — into the same node. When a query asks about Apple's supply chain, graph traversal may return information about rootstock varieties. There is no error message. The traversal succeeds. The answer looks authoritative.

CORRUPTED GRAPH (no disambiguation)

         [Apple]
        /   |   \
  acquired  grows  released
      |      |       |
   [Beats] [cool  [new chip]
           climate]

QUERY: "Apple supplier relationships"
TRAVERSAL: hits [Apple] → follows all edges → returns agriculture data
RESULT: Confidently wrong, no error signal

CORRECT GRAPH (after disambiguation)

 [Apple Inc.]          [Apple (fruit)]
     |                      |
  acquired              grows best in
     |                      |
  [Beats]           [cool climates]
     |
  released
     |
  [new chip]

Ambiguity comes in several forms: surface ambiguity (same string, different entities, like Apple or Jaguar), reference ambiguity (different strings, same entity, like Microsoft, MSFT, and the Redmond company), and temporal ambiguity (the same entity at different points in time, like a CEO who changed roles). Entity extraction models handle surface ambiguity reasonably well when context is rich, but reference ambiguity — merging MSFT and Microsoft — requires a separate co-reference resolution step that many default pipelines omit.

⚠️ Common Mistake: Assuming the LLM-based entity extractor will naturally resolve co-references across documents. Within a single document, context usually helps. Across documents, where mentions appear pages apart, co-reference resolution fails without explicit entity linking against a canonical entity store or knowledge base.

💡 Pro Tip: Before running a full indexing pass, extract a sample of entities from a representative document slice and manually inspect the resulting node list. Look specifically for (a) duplicate nodes that should be merged and (b) merged nodes that should be split. This spot-check costs an hour and can prevent weeks of debugging a corrupted graph.

The practical mitigation is a canonical entity store — a lookup table that maps surface forms to canonical identifiers before edges are written. For domain-specific corpora (legal, biomedical, financial), established ontologies often exist and should be used. For general corpora, a first-pass extraction followed by a deduplication step using embedding similarity clustering is a reasonable heuristic, though it introduces its own error surface where genuinely distinct entities may be collapsed.

Failure Mode 2: Multi-Index Synchronization on Updating Corpora

Graph RAG's index is not a flat list of embeddings — it is a structured artifact with nodes, edges, and community summaries that are interdependent. When a source document changes, the cascade of required updates is substantially more complex than updating a vector store.

Multi-index synchronization refers to keeping the knowledge graph consistent with the underlying document corpus as documents are added, edited, or removed. A concrete example illustrates the cascade:

DOCUMENT UPDATE CASCADE

Step 1: Document D is edited
         ↓
Step 2: Re-extract entities and relationships from D
         ↓
Step 3: Diff new entity set against existing graph nodes
         — New entities: add nodes
         — Removed entities: delete nodes + all incident edges
         — Changed entities: update properties
         ↓
Step 4: Merge changed edges into graph
         ↓
Step 5: Rerun community detection (changed edges may reassign clusters)
         ↓
Step 6: Recompute community summaries for affected clusters
         ↓
Step 7: Update vector index for changed summaries

Steps 5 and 6 are the expensive part. Community detection algorithms like Leiden or Louvain operate on the global graph — a change to one cluster can propagate reassignments to adjacent clusters, requiring fresh summarization across a wide neighborhood. On a large corpus with frequent document updates, full re-indexing on every change is operationally impractical.

LazyGraphRAG largely sidesteps this problem by design: because it defers LLM-based summarization to query time, the indexing step only constructs lightweight graph structure. An update requires re-extracting entities from the changed document and merging the structural changes — expensive LLM summarization is never precomputed and therefore never needs invalidation. This is the primary operational reason LazyGraphRAG has become the default starting point for new deployments handling non-static corpora.

⚠️ Common Mistake: Treating the graph index like a vector store and assuming individual document updates are cheap. A vector store update is O(1) per document — swap the embedding, done. A full GraphRAG index update is not bounded by the size of the changed document; it is bounded by the size of the affected graph neighborhood, which can be large.

For teams committed to full GraphRAG (not LazyGraphRAG), the practical mitigation is incremental community recomputation — tracking which community clusters contain nodes derived from a changed document, and re-summarizing only those clusters. This requires building infrastructure to maintain document-to-node provenance mappings, which most default pipelines do not provide out of the box.

🎯 Key Principle: Match your indexing strategy to your corpus update frequency. Static or slow-changing corpora (annual regulatory filings, historical archives) tolerate full GraphRAG. Frequently updated corpora (news feeds, internal wikis, product documentation) should default to LazyGraphRAG until incremental update tooling matures.

Failure Mode 3: Over-Indexing Scope — Building a Graph Where You Don't Need One

Graph RAG only helps for a specific class of queries: those requiring multi-hop reasoning — chaining facts across multiple entities and relationships that do not appear together in any single document. For queries that are answered by a single document passage, it adds cost and complexity with no quality benefit.

The scope creep failure looks like this: a team builds a successful proof-of-concept on a dense relational corpus (say, pharmaceutical clinical trial networks). Encouraged, they expand the indexing pipeline to cover the entire knowledge base, which also includes product FAQs, onboarding guides, and customer support transcripts. The graph grows substantially. Query latency increases. Costs climb. But the queries hitting the FAQ and onboarding content were never multi-hop — they were simple factual lookups. Graph traversal on those documents is pure overhead.

QUERY TYPE vs. RETRIEVAL STRATEGY

"What is the refund policy?"          → Single-document lookup
                                          VECTOR RAG: fast, cheap ✅
                                          GRAPH RAG: overkill ❌

"Which compounds interact with both   → Multi-hop: compound → interactions
 drug X and target Y?"                   → shared targets → Y
                                          VECTOR RAG: misses links ❌
                                          GRAPH RAG: necessary ✅

"Summarize all risk factors mentioned  → Global synthesis across cluster
across our trial portfolio"               VECTOR RAG: misses cross-doc links ❌
                                          GRAPH RAG (global): necessary ✅

The fix is to apply graph indexing selectively, not uniformly. Before indexing a content category, answer: does meaningful relational structure exist across documents in this category that single-document retrieval would miss? If yes, graph indexing adds value. If no, treat it as a vector RAG partition and route accordingly.

💡 Real-World Example: A team indexing a regulatory compliance corpus found that the bulk of user queries fell into two buckets: clause lookups ("What does section 4.2 say about data retention?") and cross-regulation synthesis ("Which regulations create conflicting obligations for data processors in the EU and US?"). Only the second bucket benefited from GraphRAG. Routing the first bucket to vector RAG cut indexing costs meaningfully while maintaining answer quality for both query types.

🧠 Mnemonic: RIMS — before graph-indexing a content category, ask: does it have Relationships that span documents, Implied connections that are never explicit in one place, Multi-entity reasoning requirements, or Synthesis needs across the corpus? If none of the four apply, vector RAG is sufficient. This is a useful heuristic for scoping decisions, not an exhaustive test.

Failure Mode 4: Community Detection Resolution Mismatches

Community detection partitions the knowledge graph into clusters of densely connected nodes, and summaries are generated at each level of the hierarchy. The resolution parameter — how coarse or fine the clusters are — directly determines query quality, and getting it wrong is hard to detect by inspecting the graph visually.

A resolution that is too coarse produces clusters that span many loosely related topics. Global queries asking for synthesis get reasonable answers. But local queries asking about a specific entity or relationship get answers assembled from a community summary that blends many topics together — the specific information is diluted by unrelated context. The answer is plausible but imprecise.

A resolution that is too fine produces clusters that each cover a narrow sub-topic. Local queries are sharp. But global queries asking for cross-domain synthesis have no cluster that bridges the relevant sub-topics — the system either returns an answer from the most relevant single cluster (missing cross-cluster synthesis) or fails to synthesize at all.

RESOLUTION MISMATCH EXAMPLES

TOO COARSE:
Cluster A = [oncology drugs + cardiovascular drugs + diagnostics equipment]
Query: "Which oncology drugs target VEGF?"
Answer: Summary mentions oncology drugs briefly among many topics — imprecise

TOO FINE:
Cluster A1 = [bevacizumab clinical data]
Cluster A2 = [ramucirumab clinical data]
Cluster A3 = [VEGF pathway biology]
(No cluster spans all three)
Query: "Synthesize VEGF-targeting drugs across our trial data"
Answer: Returns one cluster summary, misses cross-cluster synthesis

WELL-TUNED:
Cluster A = [VEGF-targeting drugs] with sub-clusters per drug
Global query → top-level cluster summary ✅
Local query → sub-cluster traversal ✅

The only reliable way to tune community resolution is to evaluate against a representative query set — a sample of actual or realistic queries, split by type (local/global/hybrid), scored against ground truth answers. Visual inspection of the graph shows you community structure but tells you nothing about whether that structure serves your query distribution.

⚠️ Common Mistake: Setting community resolution once during initial setup based on graph aesthetics ("the clusters look about right") and never revisiting it. As the corpus grows and the query distribution becomes clearer, resolution often needs adjustment. Build evaluation against a query set into your indexing pipeline review process.

Failure Mode 5: Multimodal Content Is Not Production-Ready

Knowledge bases rarely contain only clean prose. Tables, figures, diagrams, flowcharts, and embedded images are common, particularly in technical documentation, financial reports, and scientific literature. Current Graph RAG pipelines assume text as the primary extraction target, and multimodal entity extraction — reliably identifying entities and relationships from non-text content — remains largely research-stage.

The practical failure is quiet rather than catastrophic. A pipeline that cannot extract entities from a table simply skips the table or passes it through as raw text. The table's relational structure — which might encode the most precise version of a fact — is not represented as graph edges. Queries that should hit that structure return degraded answers or fall back to prose chunks that are less precise.

Wrong thinking: "The LLM can read tables, so the pipeline handles tables fine."

Correct thinking: "An LLM can answer questions about a table if the table is in context. Graph RAG entity extraction — which happens at index time, not query time — requires converting table structure into named entities and typed relationships. That conversion is a separate, non-trivial step that most production pipelines do not perform reliably."

Images and diagrams present a harder version of the same problem. A circuit diagram or an org chart encodes rich relational information that is invisible to a text-based extraction pipeline. Multimodal LLMs can describe such images, but converting descriptions into graph-compatible entity-relationship triples with sufficient precision for traversal is not solved at production scale.

The practical guidance: explicitly audit your corpus for non-text content before committing to Graph RAG, and document what percentage of your relational information lives in tables or figures. If it is substantial, either preprocess that content into text representations before extraction (table-to-text, figure captioning) with the understanding that conversion introduces its own errors, or scope the graph index to text-extractable content and accept the coverage gap.

📋 Quick Reference Card: Failure Modes at a Glance

⚠️ Failure Mode 🔍 Detection Signal 🔧 Primary Mitigation
🏷️ Entity disambiguation Confident wrong answers on ambiguous entity names Canonical entity store + co-reference resolution step
🔄 Multi-index sync Stale answers after document updates LazyGraphRAG (defers summarization) or document-to-node provenance tracking
📦 Over-indexing scope High cost, no quality gain on simple queries Selective graph indexing; route non-relational content to vector RAG
🔬 Resolution mismatch Local queries too vague or global queries miss synthesis Evaluate resolution against representative query set
🖼️ Multimodal content Silent gaps where tables or figures hold key facts Audit corpus; preprocess non-text or accept coverage gap

These failure modes share a common thread: they are all quiet. Unlike a broken API call or a timeout, a knowledge graph with corrupted entity edges, stale community summaries, or missing table content will still return answers — answers that look authoritative and are wrong in ways that are hard to surface without deliberate evaluation. The discipline required is to treat graph index quality as a first-class concern, not an assumed property of running the pipeline.

Key Takeaways and Decision Checklist

By this point in the lesson you have moved from the motivating failure modes of vector RAG through the mechanics of graph construction, into the hybrid routing architectures that define current production practice, and through the concrete ways Graph RAG systems break in the field. This final section distills all of that into actionable decision criteria you can use at the whiteboard, in code review, or when evaluating whether a Graph RAG proposal actually makes sense for your situation.

The core shift this lesson should have produced: Graph RAG is not a better version of vector RAG. It is a specialized tool that solves a specific class of retrieval problems — multi-hop reasoning across documents — and adds cost, complexity, and new failure modes everywhere else. Knowing when not to use it is as important as knowing how to use it.


What You Now Understand That You Didn't Before

Before this lesson, a reasonable engineer might have concluded: "Graph RAG is more powerful, so I should default to it for serious applications." That framing is wrong in a specific, costly way.

After this lesson, the correct mental model is:

🎯 Key Principle: Graph RAG improves retrieval accuracy on multi-hop, cross-document reasoning tasks. For single-document lookups, factual recall, and semantic similarity queries, it adds overhead with no quality benefit — and often degrades latency for the majority of queries while improving accuracy on a minority.

The practical implication: a system that blindly routes everything through Graph RAG is slower and more expensive than one that routes by query type, and the quality difference on simple queries is often negative. The teams reporting the largest accuracy gains are the ones who also implemented query classification — they got Graph RAG's upside on the queries where it helps, without paying its cost on the queries where it doesn't.


The Decision Checklist

Work through these gates in order. Stop as soon as you hit a "no."

DECISION GATE 1: Do you have a multi-hop problem?
─────────────────────────────────────────────────────────────────
 Ask: Can the target queries be answered from a single document
 or passage, or do they require chaining facts across two or
 more separate documents?

  ├─ Mostly single-document → STOP. Use vector RAG.
  └─ Genuinely multi-hop   → Continue to Gate 2.

DECISION GATE 2: Is retrieval accuracy currently insufficient?
─────────────────────────────────────────────────────────────────
 Ask: Have you measured retrieval accuracy on multi-hop queries
 with your current vector RAG system? Is it below an acceptable
 threshold for your use case?

  ├─ Not measured yet     → STOP. Measure first.
  ├─ Accuracy is fine     → STOP. Don't add complexity you don't need.
  └─ Accuracy is poor     → Continue to Gate 3.

DECISION GATE 3: Is the data structured enough for entity resolution?
─────────────────────────────────────────────────────────────────
 Ask: Can entities in your documents be reliably disambiguated?
 Do "Acme Corp" in document A and "Acme Corporation" in
 document B refer to the same entity in a detectable way?

  ├─ Entity resolution is unreliable → Address this first,
  │                                    or expect silent quality loss.
  └─ Entity resolution is feasible   → Continue to Gate 4.

DECISION GATE 4: Can you afford ongoing maintenance?
─────────────────────────────────────────────────────────────────
 Ask: Do your documents update frequently? Do you have a
 strategy for re-indexing or incremental graph updates?

  ├─ High-frequency updates with no sync plan → Revisit architecture.
  └─ Updates are manageable                   → Proceed.

OUTCOME: You have a genuine Graph RAG use case.
─────────────────────────────────────────────────────────────────
 Start with LazyGraphRAG, not full GraphRAG.
 Build query classification from day one.
 Evaluate with query-type-stratified benchmarks.

💡 Real-World Example: A compliance team building a regulatory cross-reference tool found that 70–80% of user queries were simple lookups ("What does regulation X say about Y?") and only 20–30% required chaining across documents ("Which regulations conflict on topic Z?"). Routing the first category to vector RAG and only the second to Graph RAG cut infrastructure costs substantially while preserving accuracy on the queries that actually needed graph traversal.


The Five Principles, Stated Plainly

1. Default to Vector RAG; Add Graph RAG for Identified Gaps

Vector RAG is the baseline. It is cheaper to build, cheaper to run, easier to debug, and sufficient for the majority of retrieval tasks. Graph RAG should enter the picture only after you have identified specific query types where vector RAG is failing, and measured that failure. The temptation to reach for Graph RAG preemptively — because the data feels complex or relational — is one of the most common sources of unnecessary cost and complexity.

Wrong thinking: "My data has relationships, so I should use Graph RAG."

Correct thinking: "My users are asking questions that require chaining facts across documents, and my current system answers those questions poorly. Graph RAG may help with that specific problem."

2. Start with LazyGraphRAG, Not Full GraphRAG

When you do have a genuine Graph RAG use case, LazyGraphRAG is the correct starting point. Full GraphRAG performs expensive LLM-based summarization across every community at index time, which made it impractical at scale for many teams. LazyGraphRAG defers that summarization to query time and only runs it for the communities relevant to a given query. The result is indexing costs comparable to vector RAG, while global-query quality matches full GraphRAG.

The practical implication: there is now no good reason to start with full GraphRAG for a new system. If LazyGraphRAG's quality proves insufficient for your use case after measurement, you can graduate to full GraphRAG with a clear, evidence-based justification. Running in the other direction — starting expensive and trying to optimize down — is much harder.

⚠️ Common Mistake — Mistake 1: Defaulting to full GraphRAG because older tutorials and blog posts described it as the standard approach. The default has shifted; check whether the implementation guide you're following predates the LazyGraphRAG paradigm.

3. Build the Query Classifier Early

Hybrid routing is not an optimization you add later — it's a design decision you make upfront. The query classifier that routes simple lookups to vector RAG, multi-hop queries to Graph RAG, and global synthesis queries to the LazyGraphRAG path is the component that makes the whole architecture cost-effective. Adding it as a retrofit after you've built a single-path system means re-engineering evaluation, latency budgets, and observability.

In practice, the classifier doesn't need to be complex. A well-prompted LLM router that categorizes queries into two or three buckets before retrieval is sufficient for most production systems. The important thing is that it exists from day one, so your evaluation data, cost modeling, and latency targets are built around the reality that different queries follow different paths.

💡 Pro Tip: Instrument your query classifier in production and review its routing decisions regularly. Query distributions shift over time, and a classifier that was well-calibrated at launch may misroute an increasing share of queries as user behavior evolves.

4. Invest in Entity Resolution Quality Early

Poor entity disambiguation is Graph RAG's silent quality killer. When "Dr. Smith" in one document and "Smith, J." in another are treated as separate nodes, the graph fails to surface the relationship between them. The answer isn't obviously wrong — it's just incomplete in a way that's hard to detect without a carefully constructed multi-hop evaluation set.

This is the failure mode that most commonly surprises teams after launch. Everything looks fine in aggregate metrics; it's only when you run queries specifically designed to require cross-document entity linking that the accuracy gap becomes visible.

🎯 Key Principle: Build your multi-hop evaluation set before you tune entity resolution, not after. You need a benchmark that can detect disambiguation failures in order to measure whether your resolution strategy is working.

5. Use Query-Type-Stratified Evaluation

Overall accuracy metrics actively mislead you when Graph RAG is in the picture. If Graph RAG helps on 20% of queries and adds latency to the other 80%, a single aggregate accuracy number can look fine even though you're paying a significant latency cost on the majority of requests. The metric hides the structure of the problem.

Stratified evaluation means measuring accuracy and latency separately for each query type: factual lookup, multi-hop relational, and global synthesis. This is the only way to confirm that your routing decisions are actually working — that the queries you're sending to the graph path are the ones that benefit from it.


Summary Reference Table

📋 Quick Reference Card: When to Use What

🔍 Query Type 🛠️ Recommended Path ⚠️ Watch Out For
🔒 Single-document factual lookup Vector RAG Over-engineering; graph adds no value here
🔗 Multi-hop cross-document reasoning LazyGraphRAG or full GraphRAG Entity disambiguation failures; evaluate with multi-hop benchmark
🌐 Global synthesis (themes, summaries) LazyGraphRAG (community traversal) Cost if using full GraphRAG; LazyGraphRAG is the preferred default
📊 Mixed query corpus Hybrid router + all paths Classifier drift; instrument and review routing decisions regularly


What's Still Hard: Honest Limits

This lesson has focused on the cases where Graph RAG works. It's worth being equally explicit about where the current state of the art still struggles, so you can set realistic expectations in your own systems.

Entity disambiguation at scale remains the most persistent unsolved problem. Automated coreference resolution degrades with domain-specific naming conventions, abbreviations, and ambiguous proper nouns. Investing in domain-specific NER and a validation layer around entity merging is almost always worthwhile, but it requires ongoing maintenance.

Streaming and high-frequency document updates are still genuinely difficult. LazyGraphRAG reduces re-indexing cost significantly, but multi-index synchronization — keeping the graph consistent when documents are added, updated, or deleted — requires architectural thought that simple vector RAG doesn't. If your document corpus turns over rapidly, factor this into your architecture decision from the start.

Multimodal content (images, tables, charts) is largely outside what current Graph RAG tooling handles reliably. If your use case requires reasoning across embedded figures or structured tables in addition to prose, expect to handle that with separate extraction pipelines. Treat any claim that a Graph RAG system handles multimodal content natively with scrutiny.

⚠️ Common Mistake — Mistake 2: Assuming that switching to Graph RAG will automatically surface relationships in your data. The relationships it finds are only as good as the entity extraction and community detection running underneath. Low-quality extraction produces a noisy graph that can actually hurt accuracy compared to a well-tuned vector baseline.



Practical Next Steps

If you are building or evaluating a retrieval system now, three concrete next steps follow directly from this lesson:

🔧 1. Audit your query distribution before choosing an architecture. Sample 50–100 real or representative queries from your use case and manually classify them: single-document lookup, multi-hop relational, or global synthesis. If fewer than 20–25% are genuinely multi-hop, vector RAG is likely sufficient, and the overhead of Graph RAG is probably not justified.

📚 2. Build a multi-hop evaluation set before you build the graph. Select 20–30 queries that require chaining facts across at least two separate documents, with known ground-truth answers. This benchmark is your primary quality signal during development and the only reliable way to detect entity disambiguation failures before they reach production.

🎯 3. If you proceed, start with LazyGraphRAG and the Microsoft GraphRAG open-source library as a reference implementation. It gives you a production-tested starting point with documented query patterns, avoids the full GraphRAG indexing cost, and has an active development community. Once you have a working baseline, stratify your evaluation by query type and let the data tell you whether the graph path is delivering the accuracy improvement you need.

🧠 Mnemonic: "Route first, graph second, measure always." Query classification before graph traversal, LazyGraphRAG before full GraphRAG, stratified metrics before conclusions.


The Headline Lesson

Graph RAG became practically viable not because it became simpler, but because the cost barrier dropped enough to make the tradeoff calculable. The architecture decisions that work in production — hybrid routing, LazyGraphRAG as default, stratified evaluation — are all responses to a single underlying truth: graph structure helps on a specific class of queries, and the goal is to apply it precisely there, not everywhere.

The teams getting the most value from Graph RAG are not the ones who committed to it most fully. They are the ones who were most disciplined about when not to use it.

⚠️ Final critical point to remember: Graph RAG's quality advantage on multi-hop queries is real and documented. But that advantage only appears in evaluation if your benchmark actually tests multi-hop queries. A benchmark composed primarily of simple lookups will show Graph RAG as slower and more expensive than vector RAG — because for that query type, it is. Build the benchmark that matches your actual use case, and let it drive the architecture decision.