You are viewing a preview of this lesson. Sign in to start learning
Back to 2026 Modern AI Search & RAG Roadmap

Document Processing

Handle diverse data sources: PDFs, databases, APIs, logs with appropriate parsers and extractors.

Document Processing in AI Search & RAG Systems

Master document processing techniques with free flashcards and practical examples that prepare you for building production-ready AI search systems. This lesson covers text extraction, document parsing, chunking strategies, and metadata enrichmentβ€”essential skills for creating effective Retrieval-Augmented Generation (RAG) applications.

Welcome to Document Processing πŸ“„

Document processing is the critical first stage of any AI search or RAG pipeline. Before you can search, retrieve, or generate answers from your data, you must transform raw documents into a structured, searchable format. This lesson will guide you through the entire document processing workflow, from ingesting PDFs and Word files to preparing clean, semantically meaningful chunks ready for embedding and indexing.

Whether you're building a knowledge base search engine, a chatbot that answers from company documents, or a research assistant that synthesizes information from multiple sources, robust document processing determines the quality of your entire system. Poor processing leads to garbage-in-garbage-out scenarios where your AI provides inaccurate or incomplete answers.

πŸ’‘ Key insight: Document processing typically consumes 40-60% of initial RAG development time, but investing effort here pays massive dividends in retrieval accuracy and user satisfaction.

Core Concepts in Document Processing πŸ”

1. Document Ingestion & Format Handling

Document ingestion is the process of loading files from various sources (local storage, cloud buckets, databases, APIs) and extracting their content. Different formats require specialized parsing approaches:

FormatChallengesCommon Tools
πŸ“„ PDFComplex layouts, scanned images, tablesPyPDF2, pdfplumber, PyMuPDF
πŸ“ DOCXEmbedded objects, formatting preservationpython-docx, mammoth
🌐 HTMLTag noise, JavaScript content, adsBeautifulSoup, Trafilatura
πŸ“Š SpreadsheetsMultiple sheets, formulas, merged cellsopenpyxl, pandas
πŸ“· ImagesText extraction from visual contentTesseract OCR, AWS Textract

Text Extraction involves pulling readable text from these formats while handling:

  • Encoding issues: UTF-8, Latin-1, and other character sets
  • Layout preservation: Maintaining paragraph structure, headings, lists
  • Special elements: Tables, captions, footnotes, headers/footers
  • OCR requirements: Scanned PDFs need Optical Character Recognition

πŸ’‘ Pro tip: Always validate extracted text quality on a sample before processing your entire corpus. A 5-document test can reveal systematic extraction errors.

2. Text Cleaning & Normalization

Raw extracted text contains noise that degrades search quality. Text cleaning removes or standardizes:

Common cleaning operations:

  • Remove boilerplate content: Headers, footers, page numbers, copyright notices
  • Strip formatting artifacts: Extra whitespace, line breaks, control characters
  • Normalize unicode: Convert similar characters (smart quotes β†’ straight quotes)
  • Handle special symbols: Mathematical notation, currency symbols, emojis
  • Fix encoding errors: Mojibake (garbled text from encoding mismatches)

Normalization techniques:

  • Case folding: Convert to lowercase for case-insensitive matching
  • Punctuation handling: Remove or standardize based on use case
  • Whitespace normalization: Multiple spaces/tabs β†’ single space
  • Line break standardization: \r\n, \n, \r β†’ consistent format

⚠️ Warning: Over-aggressive cleaning can remove important context! Preserve domain-specific terminology, code snippets, and structured data.

## Example: Basic text cleaning pipeline
import re
import unicodedata

def clean_text(text):
    # Normalize unicode (NFKC form)
    text = unicodedata.normalize('NFKC', text)
    
    # Remove control characters except newlines/tabs
    text = ''.join(ch for ch in text if ch == '\n' or ch == '\t' or not unicodedata.category(ch).startswith('C'))
    
    # Normalize whitespace
    text = re.sub(r'[ \t]+', ' ', text)
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Remove common boilerplate patterns
    text = re.sub(r'Page \d+ of \d+', '', text)
    
    return text.strip()

3. Document Chunking Strategies

Chunking divides documents into smaller segments for embedding and retrieval. This is crucial because:

  • Embedding models have token limits (e.g., 512-8192 tokens)
  • Smaller chunks provide more precise retrieval (return relevant paragraphs, not entire documents)
  • Chunks must be semantically complete (contain enough context to be understood independently)

πŸ”Ί Chunking Strategies Comparison

StrategyProsConsBest For
Fixed-size
(e.g., 512 tokens)
Simple, predictable, fastBreaks mid-sentence, loses contextLarge homogeneous corpora
Sentence-basedNatural boundaries, coherentVariable sizes, some too shortNews articles, blogs
Paragraph-basedSemantic completenessHigh size varianceWell-structured documents
RecursiveRespects structure, balanced sizesMore complex logicTechnical docs, books
SemanticTopic coherence, best retrievalComputationally expensiveResearch papers, legal docs

Overlap strategy: Include overlap between chunks (e.g., 50-100 tokens) to preserve context across boundaries. This prevents information loss when key concepts span chunk edges.

CHUNKING WITH OVERLAP

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Chunk 1 (500 tokens)      β”‚
β”‚  "...machine learning is...  β”‚
β”‚   ...neural networks can..." β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚ Overlap (50 tokens)
             β”‚ "neural networks can"
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Chunk 2 (500 tokens)      β”‚
β”‚  "neural networks can...     β”‚
β”‚   ...transformers use..."    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚ Overlap (50 tokens)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Chunk 3 (500 tokens)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Recursive chunking attempts to split on natural boundaries in order:

  1. Document sections (headings like #, ##, ###)
  2. Paragraphs (double line breaks)
  3. Sentences (periods, question marks)
  4. Words (if still too large)

This maintains semantic coherence better than naive fixed-size splitting.

πŸ’‘ Chunking rule of thumb: Aim for 200-512 tokens per chunk for most RAG applications. Smaller chunks (100-200) work better for precise fact retrieval; larger chunks (512-1000) better for context-heavy generation.

4. Metadata Extraction & Enrichment

Metadata provides critical context for filtering, ranking, and organizing search results. Beyond the text content itself, extract and store:

Document-level metadata:

  • Source information: File path, URL, database ID
  • Temporal data: Creation date, modification date, publication date
  • Authorship: Author names, organization, department
  • Document type: Report, email, presentation, contract
  • Version/revision: Track document evolution
  • Access control: Permissions, security classifications

Content metadata:

  • Language: Detected language code (en, es, fr, etc.)
  • Topic/category: Manual or auto-classified subject matter
  • Keywords/tags: Extracted or assigned descriptors
  • Entities: People, organizations, locations, dates (NER)
  • Statistics: Word count, reading level, sentiment score

Chunk-level metadata:

  • Position: Chunk index, page number, section heading
  • Parent document ID: Link back to source document
  • Structural role: Introduction, methodology, conclusion, etc.
  • Embedding metadata: Model used, embedding timestamp
METADATA ENRICHMENT PIPELINE

  πŸ“„ Raw Document
       β”‚
       ↓
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Extract Basic    β”‚
  β”‚ Metadata         β”‚ ← Filename, dates, format
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Language         β”‚
  β”‚ Detection        β”‚ ← langdetect, fastText
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Named Entity     β”‚
  β”‚ Recognition      β”‚ ← spaCy, Flair, LLMs
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Topic/Category   β”‚
  β”‚ Classification   β”‚ ← Zero-shot or trained
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Custom Business  β”‚
  β”‚ Logic            β”‚ ← Domain rules
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
  πŸ“Š Enriched Document + Metadata

Metadata enables powerful filtering:

## Query: "machine learning papers from 2023"
results = vector_db.search(
    query_embedding=embed("machine learning"),
    filter={
        "doc_type": "research_paper",
        "year": 2023
    },
    top_k=10
)

5. Handling Special Content Types

Tables & Structured Data

Tables pose unique challenges because their meaning depends on spatial relationships (rows, columns, headers):

Strategies:

  • Linearization: Convert to text: "Row 1: Product=Laptop, Price=$999, Stock=45"
  • Markdown tables: Preserve structure: | Product | Price | Stock |
  • Separate indexing: Treat tables as distinct searchable entities
  • Caption inclusion: Always include table captions/titles for context
  • Column header repetition: Repeat headers for each row to maintain context
Code Blocks

Source code in documentation requires special handling:

  • Preserve formatting: Maintain indentation, line breaks
  • Language detection: Identify programming language for syntax-aware processing
  • Comment extraction: Separate code from explanatory comments
  • Function-level chunking: Split on function/class boundaries
  • Syntax validation: Verify code blocks are complete and valid
Mathematical Notation

LaTeX and equations:

  • Convert LaTeX to readable format: $\frac{a}{b}$ β†’ "a divided by b"
  • Preserve for technical users who understand notation
  • Include text descriptions alongside equations
  • Extract variable definitions and meanings
Images & Diagrams

Visual content strategies:

  • OCR for embedded text: Extract text from diagrams, screenshots
  • Image captioning: Use vision-language models (CLIP, BLIP) to generate descriptions
  • Alt-text: Use existing alt-text from HTML, DOCX accessibility tags
  • Reference in text: Link image descriptions to surrounding text context

6. Quality Validation & Error Handling

Document processing failures are common. Implement robust validation:

Pre-processing validation:

  • βœ… File exists and is readable
  • βœ… Format matches expected type (not renamed .doc as .pdf)
  • βœ… File size is reasonable (not corrupted or truncated)
  • βœ… Encoding is detectable

Post-extraction validation:

  • βœ… Text length > minimum threshold (not empty)
  • βœ… Character diversity (not all special characters)
  • βœ… Language matches expected (for multi-lingual systems)
  • βœ… No excessive repetition (OCR artifacts)

Chunk validation:

  • βœ… Chunk size within bounds
  • βœ… Chunks contain complete sentences
  • βœ… No excessive overlap
  • βœ… Metadata fields populated
ERROR HANDLING WORKFLOW

  πŸ“„ Document β†’ Process
                   β”‚
              β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
              ↓         ↓
           βœ… Success  ❌ Error
              β”‚         β”‚
              β”‚    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
              β”‚    ↓         ↓
              β”‚  Retry   Log & Skip
              β”‚    β”‚         β”‚
              β”‚    ↓         ↓
              β”‚  Success?  Manual
              β”‚    β”‚       Review
              β”‚    ↓       Queue
              β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   ↓
              Store Result

Logging best practices:

  • Record document ID, processing timestamp, errors encountered
  • Track processing time (identify bottlenecks)
  • Store sample failures for debugging
  • Monitor success rate metrics

Practical Examples πŸ› οΈ

Example 1: Building a PDF Processing Pipeline

Let's build a complete pipeline for processing technical documentation PDFs:

import pymupdf  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
import hashlib
from datetime import datetime

class PDFProcessor:
    def __init__(self, chunk_size=500, chunk_overlap=50):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["\n\n", "\n", ". ", " ", ""]
        )
    
    def extract_text_with_metadata(self, pdf_path):
        """Extract text and metadata from PDF"""
        doc = pymupdf.open(pdf_path)
        
        # Extract document metadata
        doc_metadata = {
            "source": pdf_path,
            "title": doc.metadata.get("title", ""),
            "author": doc.metadata.get("author", ""),
            "page_count": len(doc),
            "creation_date": doc.metadata.get("creationDate", ""),
            "processed_at": datetime.utcnow().isoformat()
        }
        
        # Extract text page by page
        pages = []
        for page_num, page in enumerate(doc, start=1):
            text = page.get_text("text")
            # Clean extracted text
            text = self.clean_text(text)
            
            pages.append({
                "page_number": page_num,
                "text": text,
                "char_count": len(text)
            })
        
        doc.close()
        return doc_metadata, pages
    
    def clean_text(self, text):
        """Clean and normalize extracted text"""
        import re
        
        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Remove page headers/footers (simple heuristic)
        lines = text.split('\n')
        cleaned_lines = []
        for line in lines:
            # Skip very short lines (likely headers/footers)
            if len(line.strip()) > 10:
                cleaned_lines.append(line)
        
        return '\n'.join(cleaned_lines).strip()
    
    def chunk_document(self, pages, doc_metadata):
        """Split document into chunks with metadata"""
        # Combine all pages
        full_text = "\n\n".join([p["text"] for p in pages])
        
        # Create chunks
        chunks = self.splitter.split_text(full_text)
        
        # Enrich each chunk with metadata
        enriched_chunks = []
        for idx, chunk_text in enumerate(chunks):
            chunk_id = hashlib.md5(
                f"{doc_metadata['source']}_{idx}".encode()
            ).hexdigest()[:16]
            
            enriched_chunks.append({
                "chunk_id": chunk_id,
                "text": chunk_text,
                "chunk_index": idx,
                "doc_metadata": doc_metadata,
                "char_count": len(chunk_text),
                "word_count": len(chunk_text.split())
            })
        
        return enriched_chunks

## Usage
processor = PDFProcessor(chunk_size=500, chunk_overlap=50)
doc_meta, pages = processor.extract_text_with_metadata("technical_manual.pdf")
chunks = processor.chunk_document(pages, doc_meta)

print(f"Processed {doc_meta['page_count']} pages into {len(chunks)} chunks")
print(f"First chunk: {chunks[0]['text'][:200]}...")

Key features:

  • βœ… Extracts document-level and page-level metadata
  • βœ… Cleans text while preserving structure
  • βœ… Uses recursive splitting for natural boundaries
  • βœ… Generates unique chunk IDs for tracking
  • βœ… Maintains parent document reference

Example 2: Web Page Scraping with Noise Removal

Web pages contain navigation, ads, and boilerplate. Here's how to extract clean article text:

import trafilatura
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

class WebPageProcessor:
    def extract_article(self, url):
        """Extract main article content from web page"""
        # Download page
        response = requests.get(url, timeout=10)
        html = response.text
        
        # Use trafilatura for main content extraction
        # (removes nav, ads, footers automatically)
        text = trafilatura.extract(
            html,
            include_comments=False,
            include_tables=True,
            output_format="txt"
        )
        
        # Extract metadata using BeautifulSoup
        soup = BeautifulSoup(html, 'html.parser')
        
        metadata = {
            "url": url,
            "domain": urlparse(url).netloc,
            "title": self._get_title(soup),
            "description": self._get_meta_description(soup),
            "author": self._get_author(soup),
            "publish_date": self._get_publish_date(soup),
            "lang": self._get_language(soup)
        }
        
        return text, metadata
    
    def _get_title(self, soup):
        # Try multiple methods
        if soup.title:
            return soup.title.string
        og_title = soup.find("meta", property="og:title")
        if og_title:
            return og_title.get("content")
        return ""
    
    def _get_meta_description(self, soup):
        desc = soup.find("meta", {"name": "description"})
        if desc:
            return desc.get("content")
        og_desc = soup.find("meta", property="og:description")
        if og_desc:
            return og_desc.get("content")
        return ""
    
    def _get_author(self, soup):
        # Common author meta tags
        author = soup.find("meta", {"name": "author"})
        if author:
            return author.get("content")
        return ""
    
    def _get_publish_date(self, soup):
        # Look for common date patterns
        date_meta = soup.find("meta", property="article:published_time")
        if date_meta:
            return date_meta.get("content")
        return ""
    
    def _get_language(self, soup):
        html_tag = soup.find("html")
        if html_tag and html_tag.get("lang"):
            return html_tag.get("lang")
        return "en"  # default

## Usage
processor = WebPageProcessor()
text, metadata = processor.extract_article("https://example.com/article")
print(f"Title: {metadata['title']}")
print(f"Author: {metadata['author']}")
print(f"Content length: {len(text)} chars")

Why this works:

  • πŸ“ trafilatura uses ML-based content extraction (not just CSS selectors)
  • πŸ“ Handles diverse site layouts without custom rules
  • πŸ“ Extracts structured metadata from common meta tags
  • πŸ“ Falls back gracefully when metadata is missing

Example 3: Multi-Format Document Loader

A unified loader that handles multiple document types:

from pathlib import Path
import mimetypes
from typing import Dict, List
import docx
import pandas as pd
import pymupdf

class UniversalDocumentLoader:
    def __init__(self):
        self.handlers = {
            'application/pdf': self._load_pdf,
            'application/vnd.openxmlformats-officedocument.wordprocessingml.document': self._load_docx,
            'text/plain': self._load_txt,
            'text/html': self._load_html,
            'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': self._load_xlsx
        }
    
    def load(self, file_path: str) -> Dict:
        """Load document and return text + metadata"""
        path = Path(file_path)
        
        # Detect MIME type
        mime_type, _ = mimetypes.guess_type(file_path)
        
        if mime_type not in self.handlers:
            raise ValueError(f"Unsupported file type: {mime_type}")
        
        # Call appropriate handler
        handler = self.handlers[mime_type]
        text, format_specific_meta = handler(path)
        
        # Add common metadata
        metadata = {
            "source": str(path),
            "filename": path.name,
            "format": mime_type,
            "size_bytes": path.stat().st_size,
            **format_specific_meta
        }
        
        return {"text": text, "metadata": metadata}
    
    def _load_pdf(self, path: Path) -> tuple:
        doc = pymupdf.open(path)
        text = "\n\n".join([page.get_text() for page in doc])
        meta = {
            "page_count": len(doc),
            "title": doc.metadata.get("title", "")
        }
        doc.close()
        return text, meta
    
    def _load_docx(self, path: Path) -> tuple:
        doc = docx.Document(path)
        text = "\n\n".join([para.text for para in doc.paragraphs])
        meta = {
            "paragraph_count": len(doc.paragraphs)
        }
        return text, meta
    
    def _load_txt(self, path: Path) -> tuple:
        with open(path, 'r', encoding='utf-8') as f:
            text = f.read()
        return text, {}
    
    def _load_html(self, path: Path) -> tuple:
        from bs4 import BeautifulSoup
        with open(path, 'r', encoding='utf-8') as f:
            soup = BeautifulSoup(f.read(), 'html.parser')
        # Remove script and style tags
        for script in soup(["script", "style"]):
            script.decompose()
        text = soup.get_text(separator='\n')
        return text, {"title": soup.title.string if soup.title else ""}
    
    def _load_xlsx(self, path: Path) -> tuple:
        # Load all sheets
        xlsx = pd.ExcelFile(path)
        texts = []
        for sheet_name in xlsx.sheet_names:
            df = pd.read_excel(xlsx, sheet_name=sheet_name)
            # Convert to markdown table
            texts.append(f"## Sheet: {sheet_name}\n\n")
            texts.append(df.to_markdown(index=False))
        return "\n\n".join(texts), {"sheet_count": len(xlsx.sheet_names)}

## Usage
loader = UniversalDocumentLoader()

for file in ["report.pdf", "notes.docx", "data.xlsx"]:
    result = loader.load(file)
    print(f"Loaded {result['metadata']['filename']}: {len(result['text'])} chars")

Design benefits:

  • 🎯 Single interface for multiple formats
  • 🎯 Automatic format detection via MIME types
  • 🎯 Extensible: Add new handlers easily
  • 🎯 Consistent output: All handlers return same structure

Example 4: Semantic Chunking with Embeddings

Advanced chunking that groups semantically related sentences:

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk

nltk.download('punkt', quiet=True)

class SemanticChunker:
    def __init__(self, model_name='all-MiniLM-L6-v2', similarity_threshold=0.5):
        self.model = SentenceTransformer(model_name)
        self.threshold = similarity_threshold
    
    def chunk_text(self, text: str, max_chunk_size: int = 500) -> List[str]:
        """Split text into semantically coherent chunks"""
        # Split into sentences
        sentences = nltk.sent_tokenize(text)
        
        # Compute embeddings for all sentences
        embeddings = self.model.encode(sentences)
        
        # Build chunks by grouping similar adjacent sentences
        chunks = []
        current_chunk = [sentences[0]]
        current_embedding = embeddings[0]
        
        for i in range(1, len(sentences)):
            sentence = sentences[i]
            sentence_embedding = embeddings[i]
            
            # Calculate similarity to current chunk
            similarity = cosine_similarity(
                [current_embedding],
                [sentence_embedding]
            )[0][0]
            
            # Check if chunk would exceed size
            current_text = " ".join(current_chunk)
            would_exceed = len(current_text) + len(sentence) > max_chunk_size
            
            if similarity >= self.threshold and not would_exceed:
                # Add to current chunk
                current_chunk.append(sentence)
                # Update chunk embedding (running average)
                current_embedding = np.mean([current_embedding, sentence_embedding], axis=0)
            else:
                # Start new chunk
                chunks.append(" ".join(current_chunk))
                current_chunk = [sentence]
                current_embedding = sentence_embedding
        
        # Add final chunk
        if current_chunk:
            chunks.append(" ".join(current_chunk))
        
        return chunks

## Usage
chunker = SemanticChunker(similarity_threshold=0.6)

text = """
Artificial intelligence is transforming healthcare. AI systems can analyze medical images.
Doctors use AI to detect diseases early. Machine learning models predict patient outcomes.

Climate change poses serious risks. Rising temperatures affect ecosystems.
Extreme weather events are becoming more frequent. Scientists urge immediate action.
"""

chunks = chunker.chunk_text(text, max_chunk_size=200)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}\n")

Output:

Chunk 1: Artificial intelligence is transforming healthcare. AI systems can analyze medical images. Doctors use AI to detect diseases early. Machine learning models predict patient outcomes.

Chunk 2: Climate change poses serious risks. Rising temperatures affect ecosystems. Extreme weather events are becoming more frequent. Scientists urge immediate action.

Notice: Sentences about AI/healthcare stayed together, separate from climate change sentences, because semantic chunking detected the topic shift.

Common Mistakes to Avoid ⚠️

1. Ignoring Character Encoding

Problem: Assuming all text is UTF-8 leads to garbled output.

## ❌ WRONG: Assumes UTF-8
with open('document.txt', 'r') as f:
    text = f.read()  # UnicodeDecodeError!

## βœ… RIGHT: Detect encoding first
import chardet

with open('document.txt', 'rb') as f:
    raw_data = f.read()
    detected = chardet.detect(raw_data)
    encoding = detected['encoding']

with open('document.txt', 'r', encoding=encoding) as f:
    text = f.read()

2. Losing Document Structure

Problem: Treating all text as a flat blob loses hierarchical information.

## ❌ WRONG: All sections mixed together
text = " ".join([p.text for p in doc.paragraphs])

## βœ… RIGHT: Preserve headings and structure
structured_text = []
for para in doc.paragraphs:
    if para.style.name.startswith('Heading'):
        level = para.style.name[-1]  # Heading 1, 2, 3...
        structured_text.append(f"\n{'#' * int(level)} {para.text}\n")
    else:
        structured_text.append(para.text)

3. Over-Chunking or Under-Chunking

Problem: Chunks too small lack context; too large hurt retrieval precision.

## ❌ WRONG: Chunks of 50 tokens (too small)
## Each chunk: "The model uses attention. It processes sequences."
## Missing context!

## ❌ WRONG: Chunks of 5000 tokens (too large)
## Returns entire chapter when user asks specific question

## βœ… RIGHT: 200-512 tokens, with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50
)

4. Neglecting Metadata

Problem: Storing only text makes filtering and debugging impossible.

## ❌ WRONG: Just text
chunks = ["text1", "text2", "text3"]

## βœ… RIGHT: Rich metadata
chunks = [
    {
        "text": "text1",
        "source": "doc123.pdf",
        "page": 5,
        "chunk_id": "abc123",
        "doc_type": "technical_manual",
        "created_at": "2024-01-15"
    },
    # ...
]

5. Not Validating Extraction Quality

Problem: Processing continues with corrupted/empty text.

## ❌ WRONG: No validation
text = extract_text(file)
chunks = split(text)  # What if text is empty or garbled?

## βœ… RIGHT: Validate before proceeding
text = extract_text(file)

if len(text) < 50:
    raise ValueError(f"Extracted text too short: {len(text)} chars")

if text.count('οΏ½') > len(text) * 0.01:  # >1% replacement chars
    raise ValueError("Text contains too many invalid characters")

## Check language if expected
if detect_language(text) != 'en':
    logger.warning(f"Expected English, got {detect_language(text)}")

6. Hardcoding File Paths

Problem: Non-portable code that breaks across environments.

## ❌ WRONG: Absolute Windows path
file = "C:\\Users\\John\\Documents\\data.pdf"

## βœ… RIGHT: Relative paths or environment variables
from pathlib import Path
import os

data_dir = Path(os.getenv('DATA_DIR', './data'))
file = data_dir / 'documents' / 'data.pdf'

7. Forgetting Error Recovery

Problem: One bad document crashes entire pipeline.

## ❌ WRONG: No error handling
for file in files:
    process(file)  # Crash on first error!

## βœ… RIGHT: Graceful error handling
for file in files:
    try:
        process(file)
    except Exception as e:
        logger.error(f"Failed to process {file}: {e}")
        failed_files.append((file, str(e)))
        continue  # Process remaining files

## Report failures at end
if failed_files:
    print(f"Failed to process {len(failed_files)} files")

Key Takeaways 🎯

πŸ“‹ Document Processing Quick Reference

ConceptKey Points
Format Handling Use specialized libraries per format (PyMuPDF for PDF, python-docx for DOCX). Detect MIME types automatically. Always validate extraction quality.
Text Cleaning Remove boilerplate, normalize whitespace and unicode. Preserve domain-specific terms. Balance cleaning vs. information loss.
Chunking Target 200-512 tokens per chunk. Use recursive splitting for natural boundaries. Add 50-100 token overlap. Validate chunk completeness.
Metadata Extract source, temporal, authorship data. Enrich with NER, topics, categories. Enable filtering and traceability.
Special Content Linearize tables with headers. Preserve code formatting. OCR images. Caption all visual elements.
Quality Control Validate encoding, length, language. Log errors and failures. Monitor processing metrics. Handle errors gracefully.

Golden Rules:

  • πŸ† Always preserve context: Chunks must be understandable independently
  • πŸ† Metadata is not optional: It enables filtering, debugging, and audit trails
  • πŸ† Validate early and often: Catch extraction errors before they propagate
  • πŸ† Design for failure: One bad document shouldn't break your entire pipeline
  • πŸ† Test on real data: Sample documents reveal edge cases synthetic data misses

Document Processing Pipeline Architecture

END-TO-END DOCUMENT PROCESSING PIPELINE

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INPUT SOURCES                           β”‚
β”‚   πŸ“ File System  ☁️ Cloud Storage  🌐 APIs  πŸ“§ Email    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   1. INGESTION LAYER                        β”‚
β”‚  β€’ Format detection (MIME types)                            β”‚
β”‚  β€’ File validation (size, readability)                      β”‚
β”‚  β€’ Queue management (async processing)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  2. EXTRACTION LAYER                        β”‚
β”‚  β€’ PDF β†’ PyMuPDF / pdfplumber                               β”‚
β”‚  β€’ DOCX β†’ python-docx                                       β”‚
β”‚  β€’ HTML β†’ trafilatura / BeautifulSoup                       β”‚
β”‚  β€’ Images β†’ Tesseract OCR                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   3. CLEANING LAYER                         β”‚
β”‚  β€’ Unicode normalization (NFKC)                             β”‚
β”‚  β€’ Boilerplate removal (headers/footers)                    β”‚
β”‚  β€’ Whitespace normalization                                 β”‚
β”‚  β€’ Encoding error fixing                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   4. CHUNKING LAYER                         β”‚
β”‚  β€’ Strategy selection (fixed/recursive/semantic)            β”‚
β”‚  β€’ Size validation (200-512 tokens)                         β”‚
β”‚  β€’ Overlap application (50-100 tokens)                      β”‚
β”‚  β€’ Boundary detection (sentences/paragraphs)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 5. ENRICHMENT LAYER                         β”‚
β”‚  β€’ Metadata extraction (dates, authors)                     β”‚
β”‚  β€’ Language detection                                       β”‚
β”‚  β€’ Named entity recognition (NER)                           β”‚
β”‚  β€’ Topic classification                                     β”‚
β”‚  β€’ Unique ID generation                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   6. VALIDATION LAYER                       β”‚
β”‚  β€’ Length checks (min/max)                                  β”‚
β”‚  β€’ Language validation                                      β”‚
β”‚  β€’ Character diversity checks                               β”‚
β”‚  β€’ Metadata completeness                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          ↓
                    β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
                    ↓           ↓
              βœ… VALID      ❌ INVALID
                    β”‚           β”‚
                    β”‚           ↓
                    β”‚    πŸ“‹ Error Queue
                    β”‚    (Manual Review)
                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     OUTPUT STORAGE                          β”‚
β”‚   πŸ’Ύ Vector DB  πŸ“Š Document Store  πŸ—„οΈ Metadata DB         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘ Pro tip: Implement this pipeline with message queues (e.g., Kafka, RabbitMQ) for scalability. Each layer can scale independently based on load.

Performance Optimization Tips πŸš€

  1. Parallel processing: Process multiple documents concurrently
from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=4) as executor:
    results = executor.map(process_document, file_list)
  1. Batch operations: Group similar operations (e.g., embed multiple chunks at once)
## βœ… GOOD: Batch embedding
embeddings = model.encode(chunk_texts, batch_size=32)

## ❌ SLOW: Individual embedding
embeddings = [model.encode(text) for text in chunk_texts]
  1. Caching: Store extraction results to avoid reprocessing
import hashlib
import pickle

def get_cache_key(file_path):
    with open(file_path, 'rb') as f:
        file_hash = hashlib.md5(f.read()).hexdigest()
    return f"extracted_{file_hash}"

## Check cache before processing
if cache_key in cache:
    return cache[cache_key]
  1. Stream large files: Don't load entire file into memory
## βœ… GOOD: Stream processing
for page_num in range(len(pdf_doc)):
    page = pdf_doc[page_num]
    process_page(page)
    # Page released from memory

## ❌ BAD: Load all at once
all_pages = [page.get_text() for page in pdf_doc]  # Memory spike!

Further Study πŸ“š

Deepen your document processing expertise with these resources:

  1. LangChain Documentation - Document Loaders: Comprehensive guide to document loaders and text splitters with code examples
    https://python.langchain.com/docs/modules/data_connection/document_loaders/

  2. Unstructured.io Documentation: Open-source library for preprocessing diverse document types (PDFs, images, HTML) for LLM applications
    https://unstructured-io.github.io/unstructured/

  3. Apache Tika: Powerful toolkit for detecting and extracting metadata/text from 1000+ file types, with Python bindings
    https://tika.apache.org/


Next Steps: Now that you've mastered document processing, the next node in the roadmap covers Embedding Generationβ€”transforming your processed text chunks into vector representations for semantic search. You'll learn about embedding models, dimensionality considerations, and batch processing strategies! 🎯