You are viewing a preview of this lesson. Sign in to start learning
Back to 2026 Modern AI Search & RAG Roadmap

RAG Architecture & Implementation

Build Retrieval-Augmented Generation systems that ground LLM outputs in retrieved facts to eliminate hallucinations.

RAG Architecture & Implementation

Master Retrieval-Augmented Generation (RAG) architecture with free flashcards and hands-on implementation guidance. This lesson covers RAG system design, vector databases, embedding strategies, retrieval mechanisms, and prompt engineeringโ€”essential skills for building modern AI search applications that combine large language models with external knowledge bases.

Welcome ๐ŸŽ‰

Retrieval-Augmented Generation represents a paradigm shift in how we build AI applications. Rather than relying solely on a language model's parametric knowledge (what it learned during training), RAG systems dynamically retrieve relevant information from external sources and incorporate it into the generation process. This approach dramatically reduces hallucinations, enables access to up-to-date information, and allows AI systems to work with proprietary or domain-specific data.

In this comprehensive lesson, you'll learn how to architect and implement production-grade RAG systems from the ground up. We'll explore the complete pipelineโ€”from document ingestion and chunking strategies to vector search optimization and context-aware generation. Whether you're building a chatbot for customer support, an internal knowledge assistant, or a research tool, understanding RAG architecture is crucial for creating reliable, accurate AI applications in 2026.

Core Concepts ๐Ÿง 

What is RAG?

Retrieval-Augmented Generation (RAG) is an architectural pattern that enhances large language models by retrieving relevant information from external knowledge sources before generating a response. Think of it as giving your AI a reference library it can consult before answering questions.

TRADITIONAL LLM vs RAG

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Traditional LLM    โ”‚       โ”‚    RAG System       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค       โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                     โ”‚       โ”‚                     โ”‚
โ”‚  User Query         โ”‚       โ”‚  User Query         โ”‚
โ”‚       โ†“             โ”‚       โ”‚       โ†“             โ”‚
โ”‚   LLM Only          โ”‚       โ”‚  Retrieval Step     โ”‚
โ”‚       โ†“             โ”‚       โ”‚       โ†“             โ”‚
โ”‚  Response           โ”‚       โ”‚  Relevant Docs      โ”‚
โ”‚                     โ”‚       โ”‚       โ†“             โ”‚
โ”‚  โš ๏ธ Limited to      โ”‚       โ”‚  LLM + Context      โ”‚
โ”‚  training data      โ”‚       โ”‚       โ†“             โ”‚
โ”‚                     โ”‚       โ”‚  Response           โ”‚
โ”‚                     โ”‚       โ”‚                     โ”‚
โ”‚                     โ”‚       โ”‚  โœ… Up-to-date      โ”‚
โ”‚                     โ”‚       โ”‚  โœ… Grounded        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The RAG Pipeline Architecture

A complete RAG system consists of two main phases: indexing (offline) and retrieval-generation (online).

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   RAG ARCHITECTURE                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“‹ INDEXING PHASE (Offline)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                     โ”‚
โ”‚  ๐Ÿ“„ Documents โ†’ ๐Ÿ”ช Chunking โ†’ ๐Ÿงฎ Embedding โ†’         โ”‚
โ”‚  ๐Ÿ’พ Vector Store                                    โ”‚
โ”‚                                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ†“ (creates searchable index)

๐Ÿ” RETRIEVAL-GENERATION PHASE (Online)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                     โ”‚
โ”‚  โ“ User Query โ†’ ๐Ÿงฎ Query Embedding โ†’                โ”‚
โ”‚  ๐Ÿ”Ž Vector Search โ†’ ๐Ÿ“š Retrieved Chunks โ†’            โ”‚
โ”‚  ๐Ÿค– LLM (query + context) โ†’ โœ… Response              โ”‚
โ”‚                                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Document Chunking Strategies

Chunking is the process of breaking documents into smaller, semantically meaningful pieces. This is critical because:

  1. Embedding models have token limits (typically 512-8192 tokens)
  2. Retrieval precision improves with focused chunks
  3. Context windows are limited in LLMs
Strategy Method Best For Considerations
Fixed-Size Split every N tokens/characters Uniform content, technical docs May break semantic units
Sentence-Based Split on sentence boundaries Narrative text, articles Preserves meaning but varies in size
Paragraph-Based Split on paragraph breaks Well-structured documents Chunks may be too large
Semantic Use embeddings to find natural breaks Complex documents, mixed content Computationally expensive
Recursive Try multiple delimiters hierarchically Code, structured data Most flexible, widely used

๐Ÿ’ก Pro Tip: Use overlapping chunks (50-200 token overlap) to prevent information loss at boundaries. If a key concept is split across chunks, the overlap ensures it appears complete in at least one chunk.

Embeddings and Vector Representations

Embeddings transform text into high-dimensional vectors (typically 384-1536 dimensions) where semantically similar text appears closer together in vector space.

EMBEDDING TRANSFORMATION

"cat"     โ†’  [0.2, 0.8, 0.1, ...] โ”
                                   โ”œโ”€ Close in vector space
"kitten"  โ†’  [0.3, 0.7, 0.2, ...] โ”˜  (similar meaning)

"dog"     โ†’  [0.4, 0.6, 0.3, ...] โ† Moderately close

"car"     โ†’  [0.9, 0.1, 0.8, ...] โ† Far apart
                                     (different meaning)

Popular embedding models in 2026:

Model Dimensions Max Tokens Best Use Case
text-embedding-3-small 1536 8191 Cost-effective, general purpose
text-embedding-3-large 3072 8191 Highest accuracy, complex queries
voyage-2 1024 4000 Domain-specific retrieval
cohere-embed-v3 1024 512 Multilingual support
BGE-large-en-v1.5 1024 512 Open-source, self-hosted

Vector Databases

Vector databases store embeddings and enable fast similarity search. Unlike traditional databases that search for exact matches, vector DBs find semantically similar content using distance metrics.

VECTOR SIMILARITY SEARCH

      Query Vector
           โญ
          โ•ฑโ”‚โ•ฒ
         โ•ฑ โ”‚ โ•ฒ
        โ•ฑ  โ”‚  โ•ฒ
       ๐Ÿ”ต  ๐Ÿ”ต  ๐Ÿ”ต  โ† Top-k nearest neighbors
     โ•ฑ     โ”‚     โ•ฒ   (most relevant chunks)
    ๐Ÿ”ต     ๐Ÿ”ต     ๐Ÿ”ต
   โ•ฑ       โ”‚       โ•ฒ
  ๐Ÿ”ต       ๐Ÿ”ต       ๐Ÿ”ต

Distance metrics:
โ€ข Cosine Similarity (most common)
โ€ข Euclidean Distance
โ€ข Dot Product

Leading vector database options:

Database Type Strengths Ideal For
Pinecone Managed cloud Fully managed, scalable, easy setup Production apps, startups
Weaviate Open-source/cloud GraphQL API, hybrid search, modules Complex data relationships
Qdrant Open-source Rust-based, fast, filtering support Self-hosted, performance-critical
Chroma Open-source Simple API, Python-first, lightweight Development, prototyping
pgvector PostgreSQL extension Integrates with existing PostgreSQL Projects already using Postgres
Milvus Open-source Highly scalable, distributed Large-scale enterprise apps

Retrieval Strategies

Beyond basic vector similarity search, advanced RAG systems employ sophisticated retrieval techniques:

1. Hybrid Search

Combines dense retrieval (vector similarity) with sparse retrieval (keyword/BM25) for better precision:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         HYBRID SEARCH               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                     โ”‚
โ”‚  Query: "How to optimize React?"    โ”‚
โ”‚           โ”‚                         โ”‚
โ”‚      โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”                    โ”‚
โ”‚      โ†“         โ†“                    โ”‚
โ”‚  Vector     Keyword                 โ”‚
โ”‚  Search     Search (BM25)           โ”‚
โ”‚      โ”‚         โ”‚                    โ”‚
โ”‚      โ†“         โ†“                    โ”‚
โ”‚  Results A  Results B               โ”‚
โ”‚      โ”‚         โ”‚                    โ”‚
โ”‚      โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜                    โ”‚
โ”‚           โ†“                         โ”‚
โ”‚    Reciprocal Rank Fusion           โ”‚
โ”‚           โ†“                         โ”‚
โ”‚    Final Ranked Results             โ”‚
โ”‚                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2. Contextual Compression

Retrieve large chunks but extract only relevant portions before sending to the LLM:

## Pseudo-code
initial_docs = vector_store.similarity_search(query, k=10)
compressed_docs = compressor.compress(initial_docs, query)
response = llm.generate(query, context=compressed_docs)

3. Multi-Query Retrieval

Generate multiple variations of the user's query to capture different phrasings:

  • Original: "How do I improve my code?"
  • Variant 1: "What are code optimization techniques?"
  • Variant 2: "Best practices for clean code"
  • Variant 3: "How to refactor legacy code"

4. Parent-Child Chunking

Store small chunks for retrieval but include surrounding context when found:

PARENT-CHILD STRATEGY

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚        Parent Document              โ”‚ โ† Stored separately
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  [Chunk 1] [Chunk 2] [Chunk 3]     โ”‚
โ”‚              โ†‘                      โ”‚
โ”‚         Retrieved chunk             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ†“
Return entire parent document or
surrounding N chunks for full context

Prompt Engineering for RAG

The prompt template determines how retrieved context is presented to the LLM. A well-designed prompt includes:

  1. System instructions (role, behavior guidelines)
  2. Retrieved context (formatted clearly)
  3. User query (the actual question)
  4. Output format (structured response expectations)

Example RAG prompt template:

You are a helpful assistant answering questions based on provided context.

RULES:
- Answer only using information from the context below
- If the answer isn't in the context, say "I don't have enough information"
- Cite specific parts of the context in your answer
- Be concise and accurate

CONTEXT:
{retrieved_chunks}

QUESTION:
{user_query}

ANSWER:

๐Ÿ’ก Pro Tip: Include attribution markers in your chunks (e.g., source document, page number) so the LLM can cite sources in its response.

Metadata Filtering

Metadata enhances retrieval by allowing pre-filtering before vector search:

Metadata Type Example Use Case
Temporal date, timestamp "Show results from last 6 months"
Source document_id, url, author "Search only in technical docs"
Categorical department, topic, language "Filter by HR department"
User-specific user_id, permissions "Show only documents I can access"
Content-based doc_type, file_format "Search only PDF files"

Filtered vector search flow:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Query: "2024 sales strategy"       โ”‚
โ”‚                                     โ”‚
โ”‚  Filters:                           โ”‚
โ”‚  โ€ข year = 2024                      โ”‚
โ”‚  โ€ข department = "Sales"             โ”‚
โ”‚  โ€ข type = "Strategy Doc"            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ†“
     Filter metadata FIRST
     (reduces search space)
               โ†“
     Vector similarity search
     (on filtered subset)
               โ†“
     Top-k relevant results

Evaluation Metrics

Measuring RAG system performance requires evaluating both retrieval quality and generation quality:

Retrieval Metrics:

  • Precision@k: What fraction of top-k results are relevant?
  • Recall@k: What fraction of all relevant docs are in top-k?
  • MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant result
  • NDCG (Normalized Discounted Cumulative Gain): Considers ranking quality

Generation Metrics:

  • Faithfulness: Does the answer align with retrieved context?
  • Answer Relevancy: Does it address the user's question?
  • Context Precision: Are retrieved chunks actually relevant?
  • Context Recall: Was all necessary information retrieved?

๐Ÿ”ง Try this: Use frameworks like RAGAS (RAG Assessment) or TruLens to automatically evaluate your system:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

result = evaluate(
    dataset=test_questions,
    metrics=[faithfulness, answer_relevancy]
)
print(result.scores)

Real-World Implementation Examples ๐Ÿ”ง

Example 1: Basic RAG System with LangChain

Let's build a minimal RAG system for a company knowledge base:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader

## Step 1: Load documents
loader = DirectoryLoader('./docs', glob="**/*.txt")
documents = loader.load()

## Step 2: Chunk documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)

## Step 3: Create embeddings and store in vector DB
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

## Step 4: Create retrieval chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # All context in one prompt
    retriever=vector_store.as_retriever(
        search_kwargs={"k": 4}  # Retrieve top 4 chunks
    ),
    return_source_documents=True
)

## Step 5: Query the system
query = "What is our remote work policy?"
result = qa_chain({"query": query})

print(f"Answer: {result['result']}")
print(f"\nSources: {result['source_documents']}")

What's happening:

  1. Document loading: Reads all .txt files from a directory
  2. Chunking: Splits into 1000-character chunks with 200-char overlap
  3. Embedding: Converts chunks to vectors using OpenAI's model
  4. Storage: Persists vectors in Chroma (local SQLite-based vector DB)
  5. Retrieval: Finds 4 most similar chunks to the query
  6. Generation: LLM generates answer using retrieved context

Example 2: Advanced RAG with Metadata Filtering

Adding temporal and source filtering for a news search system:

from datetime import datetime
import chromadb
from chromadb.config import Settings

## Initialize Chroma with metadata support
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./news_db"
))

collection = client.get_or_create_collection(
    name="news_articles",
    metadata={"description": "News articles with temporal metadata"}
)

## Add documents with rich metadata
articles = [
    {
        "text": "New AI regulations announced in EU...",
        "metadata": {
            "source": "TechNews",
            "date": "2024-03-15",
            "category": "regulation",
            "author": "Jane Smith"
        }
    },
    # ... more articles
]

for idx, article in enumerate(articles):
    embedding = embedding_model.encode(article["text"])
    collection.add(
        embeddings=[embedding.tolist()],
        documents=[article["text"]],
        metadatas=[article["metadata"]],
        ids=[f"article_{idx}"]
    )

## Query with metadata filtering
query = "AI regulations"
query_embedding = embedding_model.encode(query)

results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=5,
    where={
        "$and": [
            {"date": {"$gte": "2024-01-01"}},  # After Jan 1, 2024
            {"category": "regulation"},         # Only regulation news
            {"source": {"$in": ["TechNews", "AIDaily"]}}  # Trusted sources
        ]
    }
)

print(f"Found {len(results['documents'][0])} relevant articles")
for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
    print(f"\n{meta['date']} - {meta['source']}")
    print(doc[:200] + "...")

Key improvements:

  • Structured metadata: Each chunk has date, source, category, author
  • Complex filtering: Combine multiple conditions with $and, $or
  • Pre-filtering: Reduces search space before vector comparison
  • Provenance tracking: Users see where information comes from

Example 3: Hybrid Search with Reciprocal Rank Fusion

Combining dense and sparse retrieval for optimal results:

import weaviate
from rank_bm25 import BM25Okapi
import numpy as np

class HybridRAG:
    def __init__(self, documents):
        self.documents = documents
        
        # BM25 for keyword search
        tokenized_docs = [doc.split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized_docs)
        
        # Weaviate for vector search
        self.client = weaviate.Client("http://localhost:8080")
        self.setup_weaviate_schema()
        self.index_documents()
    
    def reciprocal_rank_fusion(self, rankings_list, k=60):
        """Combine multiple rankings using RRF"""
        fused_scores = {}
        
        for rankings in rankings_list:
            for rank, doc_id in enumerate(rankings, start=1):
                if doc_id not in fused_scores:
                    fused_scores[doc_id] = 0
                fused_scores[doc_id] += 1 / (k + rank)
        
        # Sort by fused score
        sorted_docs = sorted(
            fused_scores.items(),
            key=lambda x: x[1],
            reverse=True
        )
        return [doc_id for doc_id, score in sorted_docs]
    
    def search(self, query, top_k=5):
        # Sparse retrieval (BM25)
        tokenized_query = query.split()
        bm25_scores = self.bm25.get_scores(tokenized_query)
        bm25_rankings = np.argsort(bm25_scores)[::-1][:20]
        
        # Dense retrieval (Vector)
        vector_results = self.client.query.get(
            "Document",
            ["content", "doc_id"]
        ).with_near_text({
            "concepts": [query]
        }).with_limit(20).do()
        
        vector_rankings = [
            int(doc["doc_id"]) 
            for doc in vector_results["data"]["Get"]["Document"]
        ]
        
        # Fuse results
        final_rankings = self.reciprocal_rank_fusion(
            [bm25_rankings.tolist(), vector_rankings]
        )
        
        return [
            self.documents[doc_id] 
            for doc_id in final_rankings[:top_k]
        ]

## Usage
rag = HybridRAG(my_documents)
results = rag.search("machine learning best practices")

Why this works better:

  • BM25 catches exact term matches and rare keywords
  • Vector search captures semantic similarity and synonyms
  • RRF gives balanced weight to both approaches
  • Result: Higher precision and recall than either method alone

Example 4: Production RAG with Monitoring

A production-ready system with observability:

from trulens_eval import TruChain, Feedback, Tru
from trulens_eval.feedback import Groundedness
import openai
import pinecone

class ProductionRAG:
    def __init__(self):
        # Initialize components
        self.embeddings = OpenAIEmbeddings()
        
        pinecone.init(
            api_key="your-key",
            environment="us-west1-gcp"
        )
        self.vector_index = pinecone.Index("prod-knowledge-base")
        
        # Set up monitoring
        self.tru = Tru()
        self.groundedness = Groundedness(
            groundedness_provider=openai
        )
        
    def retrieve(self, query, filters=None, k=5):
        """Retrieve with optional metadata filtering"""
        query_vector = self.embeddings.embed_query(query)
        
        results = self.vector_index.query(
            vector=query_vector,
            top_k=k,
            include_metadata=True,
            filter=filters
        )
        
        return [
            {
                "text": match["metadata"]["text"],
                "source": match["metadata"]["source"],
                "score": match["score"]
            }
            for match in results["matches"]
        ]
    
    def generate(self, query, context_chunks):
        """Generate with monitoring"""
        context = "\n\n".join([
            f"[Source: {chunk['source']}]\n{chunk['text']}"
            for chunk in context_chunks
        ])
        
        prompt = f"""Answer based only on the context below.
        
Context:
{context}

Question: {query}

Answer:"""
        
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1
        )
        
        answer = response.choices[0].message.content
        
        # Log for monitoring
        self.log_interaction(query, context_chunks, answer)
        
        return answer
    
    def log_interaction(self, query, context, answer):
        """Log to monitoring system"""
        # Check groundedness (does answer match context?)
        groundedness_score = self.groundedness.score(
            source=context,
            statement=answer
        )
        
        # Log to TruLens for dashboard visualization
        self.tru.log({
            "query": query,
            "context_count": len(context),
            "answer_length": len(answer),
            "groundedness": groundedness_score,
            "timestamp": datetime.now().isoformat()
        })
        
        # Alert if quality metrics drop
        if groundedness_score < 0.7:
            self.send_alert(
                f"Low groundedness detected: {groundedness_score}"
            )

## Deployment
rag = ProductionRAG()

## Serve via API
from fastapi import FastAPI
app = FastAPI()

@app.post("/query")
async def query_rag(query: str, filters: dict = None):
    chunks = rag.retrieve(query, filters)
    answer = rag.generate(query, chunks)
    
    return {
        "answer": answer,
        "sources": [c["source"] for c in chunks]
    }

Production features:

  • โœ… Managed vector database (Pinecone) for reliability
  • โœ… Monitoring and observability (TruLens dashboard)
  • โœ… Quality metrics (groundedness, relevance)
  • โœ… Alerting for degraded performance
  • โœ… API interface for integration
  • โœ… Source attribution in responses

Common Mistakes โš ๏ธ

1. Chunks Too Large or Too Small

โŒ Mistake: Using 5000-token chunks that exceed context windows, or 50-token chunks that lack context.

โœ… Solution: Aim for 500-1000 tokens per chunk with 10-20% overlap. Test different sizes with your specific content.

2. Ignoring Metadata

โŒ Mistake: Storing only text without source, date, or category information.

โœ… Solution: Always include metadata for filtering, attribution, and debugging. Store at minimum: source, timestamp, document_id.

3. Not Evaluating Retrieval Quality

โŒ Mistake: Assuming vector search always returns relevant results.

โœ… Solution: Create a test set of query-document pairs. Measure precision@k and recall@k regularly. A/B test different retrieval strategies.

4. Single Retrieval Strategy

โŒ Mistake: Relying only on vector similarity without considering keywords.

โœ… Solution: Use hybrid search combining dense (vector) and sparse (BM25) retrieval, especially for domain-specific terminology.

5. Poor Prompt Engineering

โŒ Mistake: Passing raw retrieved chunks without clear instructions.

โœ… Solution: Structure prompts with:

  • Clear role definition
  • Explicit rules ("only use context," "cite sources")
  • Formatted context sections
  • Output format specifications

6. No Context Compression

โŒ Mistake: Sending all retrieved chunks verbatim, wasting context window.

โœ… Solution: Use contextual compression to extract only relevant sentences from each chunk, or implement re-ranking to prioritize best chunks.

7. Ignoring Token Costs

โŒ Mistake: Retrieving 10,000 tokens of context for every query.

โœ… Solution: Balance retrieval depth with cost. Start with k=3-5 chunks. Use semantic caching for repeated queries.

8. Static Chunking Without Document Structure

โŒ Mistake: Splitting a structured document (with headers, tables) using fixed character counts.

โœ… Solution: Use document-aware chunking that respects structure. For code, split by functions. For articles, by sections.

Key Takeaways ๐ŸŽฏ

  1. RAG enhances LLMs by retrieving relevant external information before generation, reducing hallucinations and enabling access to current data.

  2. The pipeline has two phases: offline indexing (chunk โ†’ embed โ†’ store) and online retrieval-generation (query โ†’ search โ†’ generate).

  3. Chunking strategy matters: Balance semantic completeness with token limits. Use overlap to prevent information loss at boundaries.

  4. Embeddings capture semantics: Similar concepts cluster in vector space. Choose embedding models based on your language, domain, and performance needs.

  5. Vector databases enable fast similarity search: Pinecone, Weaviate, Qdrant, and others provide scalable infrastructure for production RAG.

  6. Advanced retrieval improves results: Hybrid search, multi-query, parent-child chunking, and contextual compression all boost accuracy.

  7. Metadata enables filtering: Add temporal, source, and categorical metadata to narrow search before vector comparison.

  8. Evaluation is essential: Measure both retrieval quality (precision, recall) and generation quality (faithfulness, relevance).

  9. Prompt engineering controls behavior: Structure prompts clearly with role, rules, context, and output format sections.

  10. Production requires monitoring: Track groundedness, latency, token usage, and user satisfaction to maintain quality over time.

๐Ÿ“‹ Quick Reference Card: RAG Architecture

Component Purpose Key Choices
Chunking Break docs into searchable pieces 500-1000 tokens, 10-20% overlap
Embeddings Convert text to vectors text-embedding-3-small, voyage-2, BGE
Vector DB Store and search vectors Pinecone (managed), Qdrant (self-hosted)
Retrieval Find relevant chunks Hybrid search, k=3-5, metadata filters
Generation Create answer from context Structured prompt, temp=0-0.3
Evaluation Measure quality Precision@k, faithfulness, RAGAS

Common Pattern:

Documents โ†’ Chunk (1000 tok) โ†’ Embed (384-1536 dim) โ†’ 
Store (Vector DB) โ†’ Query โ†’ Retrieve (top-5) โ†’ 
Prompt (context + query) โ†’ LLM โ†’ Answer

Cost Optimization:

  • Cache embeddings and responses
  • Use smaller embedding models for dev
  • Compress context before generation
  • Implement semantic caching

๐Ÿ“š Further Study

  1. LangChain RAG Documentation - https://python.langchain.com/docs/use_cases/question_answering/ - Comprehensive tutorials and patterns for building RAG systems with LangChain framework.

  2. Pinecone Learning Center - https://www.pinecone.io/learn/vector-database/ - Deep dives into vector database concepts, similarity search algorithms, and production deployment strategies.

  3. RAG Papers Collection - https://github.com/microsoft/RAG-Survey - Microsoft's curated list of academic papers covering RAG architectures, evaluation methods, and advanced techniques.


๐Ÿ’ก Remember: RAG is not a single technique but an architectural pattern. Start simple with basic retrieval, measure performance, then add complexity (hybrid search, re-ranking, compression) only where needed. The best RAG system is one that reliably answers your users' questions with verifiable, accurate information.