Document Processing
Handle diverse data sources: PDFs, databases, APIs, logs with appropriate parsers and extractors.
Document Processing in AI Search & RAG Systems
Master document processing techniques with free flashcards and practical examples that prepare you for building production-ready AI search systems. This lesson covers text extraction, document parsing, chunking strategies, and metadata enrichmentβessential skills for creating effective Retrieval-Augmented Generation (RAG) applications.
Welcome to Document Processing π
Document processing is the critical first stage of any AI search or RAG pipeline. Before you can search, retrieve, or generate answers from your data, you must transform raw documents into a structured, searchable format. This lesson will guide you through the entire document processing workflow, from ingesting PDFs and Word files to preparing clean, semantically meaningful chunks ready for embedding and indexing.
Whether you're building a knowledge base search engine, a chatbot that answers from company documents, or a research assistant that synthesizes information from multiple sources, robust document processing determines the quality of your entire system. Poor processing leads to garbage-in-garbage-out scenarios where your AI provides inaccurate or incomplete answers.
π‘ Key insight: Document processing typically consumes 40-60% of initial RAG development time, but investing effort here pays massive dividends in retrieval accuracy and user satisfaction.
Core Concepts in Document Processing π
1. Document Ingestion & Format Handling
Document ingestion is the process of loading files from various sources (local storage, cloud buckets, databases, APIs) and extracting their content. Different formats require specialized parsing approaches:
| Format | Challenges | Common Tools |
|---|---|---|
| π PDF | Complex layouts, scanned images, tables | PyPDF2, pdfplumber, PyMuPDF |
| π DOCX | Embedded objects, formatting preservation | python-docx, mammoth |
| π HTML | Tag noise, JavaScript content, ads | BeautifulSoup, Trafilatura |
| π Spreadsheets | Multiple sheets, formulas, merged cells | openpyxl, pandas |
| π· Images | Text extraction from visual content | Tesseract OCR, AWS Textract |
Text Extraction involves pulling readable text from these formats while handling:
- Encoding issues: UTF-8, Latin-1, and other character sets
- Layout preservation: Maintaining paragraph structure, headings, lists
- Special elements: Tables, captions, footnotes, headers/footers
- OCR requirements: Scanned PDFs need Optical Character Recognition
π‘ Pro tip: Always validate extracted text quality on a sample before processing your entire corpus. A 5-document test can reveal systematic extraction errors.
2. Text Cleaning & Normalization
Raw extracted text contains noise that degrades search quality. Text cleaning removes or standardizes:
Common cleaning operations:
- Remove boilerplate content: Headers, footers, page numbers, copyright notices
- Strip formatting artifacts: Extra whitespace, line breaks, control characters
- Normalize unicode: Convert similar characters (smart quotes β straight quotes)
- Handle special symbols: Mathematical notation, currency symbols, emojis
- Fix encoding errors: Mojibake (garbled text from encoding mismatches)
Normalization techniques:
- Case folding: Convert to lowercase for case-insensitive matching
- Punctuation handling: Remove or standardize based on use case
- Whitespace normalization: Multiple spaces/tabs β single space
- Line break standardization: \r\n, \n, \r β consistent format
β οΈ Warning: Over-aggressive cleaning can remove important context! Preserve domain-specific terminology, code snippets, and structured data.
## Example: Basic text cleaning pipeline
import re
import unicodedata
def clean_text(text):
# Normalize unicode (NFKC form)
text = unicodedata.normalize('NFKC', text)
# Remove control characters except newlines/tabs
text = ''.join(ch for ch in text if ch == '\n' or ch == '\t' or not unicodedata.category(ch).startswith('C'))
# Normalize whitespace
text = re.sub(r'[ \t]+', ' ', text)
text = re.sub(r'\n{3,}', '\n\n', text)
# Remove common boilerplate patterns
text = re.sub(r'Page \d+ of \d+', '', text)
return text.strip()
3. Document Chunking Strategies
Chunking divides documents into smaller segments for embedding and retrieval. This is crucial because:
- Embedding models have token limits (e.g., 512-8192 tokens)
- Smaller chunks provide more precise retrieval (return relevant paragraphs, not entire documents)
- Chunks must be semantically complete (contain enough context to be understood independently)
πΊ Chunking Strategies Comparison
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Fixed-size (e.g., 512 tokens) | Simple, predictable, fast | Breaks mid-sentence, loses context | Large homogeneous corpora |
| Sentence-based | Natural boundaries, coherent | Variable sizes, some too short | News articles, blogs |
| Paragraph-based | Semantic completeness | High size variance | Well-structured documents |
| Recursive | Respects structure, balanced sizes | More complex logic | Technical docs, books |
| Semantic | Topic coherence, best retrieval | Computationally expensive | Research papers, legal docs |
Overlap strategy: Include overlap between chunks (e.g., 50-100 tokens) to preserve context across boundaries. This prevents information loss when key concepts span chunk edges.
CHUNKING WITH OVERLAP
βββββββββββββββββββββββββββββββ
β Chunk 1 (500 tokens) β
β "...machine learning is... β
β ...neural networks can..." β
ββββββββββββββ¬βββββββββββββββββ
β Overlap (50 tokens)
β "neural networks can"
ββββββββββββββ΄βββββββββββββββββ
β Chunk 2 (500 tokens) β
β "neural networks can... β
β ...transformers use..." β
ββββββββββββββ¬βββββββββββββββββ
β Overlap (50 tokens)
ββββββββββββββ΄βββββββββββββββββ
β Chunk 3 (500 tokens) β
βββββββββββββββββββββββββββββββ
Recursive chunking attempts to split on natural boundaries in order:
- Document sections (headings like
#,##,###) - Paragraphs (double line breaks)
- Sentences (periods, question marks)
- Words (if still too large)
This maintains semantic coherence better than naive fixed-size splitting.
π‘ Chunking rule of thumb: Aim for 200-512 tokens per chunk for most RAG applications. Smaller chunks (100-200) work better for precise fact retrieval; larger chunks (512-1000) better for context-heavy generation.
4. Metadata Extraction & Enrichment
Metadata provides critical context for filtering, ranking, and organizing search results. Beyond the text content itself, extract and store:
Document-level metadata:
- Source information: File path, URL, database ID
- Temporal data: Creation date, modification date, publication date
- Authorship: Author names, organization, department
- Document type: Report, email, presentation, contract
- Version/revision: Track document evolution
- Access control: Permissions, security classifications
Content metadata:
- Language: Detected language code (en, es, fr, etc.)
- Topic/category: Manual or auto-classified subject matter
- Keywords/tags: Extracted or assigned descriptors
- Entities: People, organizations, locations, dates (NER)
- Statistics: Word count, reading level, sentiment score
Chunk-level metadata:
- Position: Chunk index, page number, section heading
- Parent document ID: Link back to source document
- Structural role: Introduction, methodology, conclusion, etc.
- Embedding metadata: Model used, embedding timestamp
METADATA ENRICHMENT PIPELINE
π Raw Document
β
β
ββββββββββββββββββββ
β Extract Basic β
β Metadata β β Filename, dates, format
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Language β
β Detection β β langdetect, fastText
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Named Entity β
β Recognition β β spaCy, Flair, LLMs
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Topic/Category β
β Classification β β Zero-shot or trained
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Custom Business β
β Logic β β Domain rules
ββββββββββ¬ββββββββββ
β
π Enriched Document + Metadata
Metadata enables powerful filtering:
## Query: "machine learning papers from 2023"
results = vector_db.search(
query_embedding=embed("machine learning"),
filter={
"doc_type": "research_paper",
"year": 2023
},
top_k=10
)
5. Handling Special Content Types
Tables & Structured Data
Tables pose unique challenges because their meaning depends on spatial relationships (rows, columns, headers):
Strategies:
- Linearization: Convert to text: "Row 1: Product=Laptop, Price=$999, Stock=45"
- Markdown tables: Preserve structure:
| Product | Price | Stock | - Separate indexing: Treat tables as distinct searchable entities
- Caption inclusion: Always include table captions/titles for context
- Column header repetition: Repeat headers for each row to maintain context
Code Blocks
Source code in documentation requires special handling:
- Preserve formatting: Maintain indentation, line breaks
- Language detection: Identify programming language for syntax-aware processing
- Comment extraction: Separate code from explanatory comments
- Function-level chunking: Split on function/class boundaries
- Syntax validation: Verify code blocks are complete and valid
Mathematical Notation
LaTeX and equations:
- Convert LaTeX to readable format:
$\frac{a}{b}$β "a divided by b" - Preserve for technical users who understand notation
- Include text descriptions alongside equations
- Extract variable definitions and meanings
Images & Diagrams
Visual content strategies:
- OCR for embedded text: Extract text from diagrams, screenshots
- Image captioning: Use vision-language models (CLIP, BLIP) to generate descriptions
- Alt-text: Use existing alt-text from HTML, DOCX accessibility tags
- Reference in text: Link image descriptions to surrounding text context
6. Quality Validation & Error Handling
Document processing failures are common. Implement robust validation:
Pre-processing validation:
- β File exists and is readable
- β Format matches expected type (not renamed .doc as .pdf)
- β File size is reasonable (not corrupted or truncated)
- β Encoding is detectable
Post-extraction validation:
- β Text length > minimum threshold (not empty)
- β Character diversity (not all special characters)
- β Language matches expected (for multi-lingual systems)
- β No excessive repetition (OCR artifacts)
Chunk validation:
- β Chunk size within bounds
- β Chunks contain complete sentences
- β No excessive overlap
- β Metadata fields populated
ERROR HANDLING WORKFLOW
π Document β Process
β
ββββββ΄βββββ
β β
β
Success β Error
β β
β ββββββ΄βββββ
β β β
β Retry Log & Skip
β β β
β β β
β Success? Manual
β β Review
β β Queue
ββββββ΄βββββββββ
β
Store Result
Logging best practices:
- Record document ID, processing timestamp, errors encountered
- Track processing time (identify bottlenecks)
- Store sample failures for debugging
- Monitor success rate metrics
Practical Examples π οΈ
Example 1: Building a PDF Processing Pipeline
Let's build a complete pipeline for processing technical documentation PDFs:
import pymupdf # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
import hashlib
from datetime import datetime
class PDFProcessor:
def __init__(self, chunk_size=500, chunk_overlap=50):
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)
def extract_text_with_metadata(self, pdf_path):
"""Extract text and metadata from PDF"""
doc = pymupdf.open(pdf_path)
# Extract document metadata
doc_metadata = {
"source": pdf_path,
"title": doc.metadata.get("title", ""),
"author": doc.metadata.get("author", ""),
"page_count": len(doc),
"creation_date": doc.metadata.get("creationDate", ""),
"processed_at": datetime.utcnow().isoformat()
}
# Extract text page by page
pages = []
for page_num, page in enumerate(doc, start=1):
text = page.get_text("text")
# Clean extracted text
text = self.clean_text(text)
pages.append({
"page_number": page_num,
"text": text,
"char_count": len(text)
})
doc.close()
return doc_metadata, pages
def clean_text(self, text):
"""Clean and normalize extracted text"""
import re
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text)
# Remove page headers/footers (simple heuristic)
lines = text.split('\n')
cleaned_lines = []
for line in lines:
# Skip very short lines (likely headers/footers)
if len(line.strip()) > 10:
cleaned_lines.append(line)
return '\n'.join(cleaned_lines).strip()
def chunk_document(self, pages, doc_metadata):
"""Split document into chunks with metadata"""
# Combine all pages
full_text = "\n\n".join([p["text"] for p in pages])
# Create chunks
chunks = self.splitter.split_text(full_text)
# Enrich each chunk with metadata
enriched_chunks = []
for idx, chunk_text in enumerate(chunks):
chunk_id = hashlib.md5(
f"{doc_metadata['source']}_{idx}".encode()
).hexdigest()[:16]
enriched_chunks.append({
"chunk_id": chunk_id,
"text": chunk_text,
"chunk_index": idx,
"doc_metadata": doc_metadata,
"char_count": len(chunk_text),
"word_count": len(chunk_text.split())
})
return enriched_chunks
## Usage
processor = PDFProcessor(chunk_size=500, chunk_overlap=50)
doc_meta, pages = processor.extract_text_with_metadata("technical_manual.pdf")
chunks = processor.chunk_document(pages, doc_meta)
print(f"Processed {doc_meta['page_count']} pages into {len(chunks)} chunks")
print(f"First chunk: {chunks[0]['text'][:200]}...")
Key features:
- β Extracts document-level and page-level metadata
- β Cleans text while preserving structure
- β Uses recursive splitting for natural boundaries
- β Generates unique chunk IDs for tracking
- β Maintains parent document reference
Example 2: Web Page Scraping with Noise Removal
Web pages contain navigation, ads, and boilerplate. Here's how to extract clean article text:
import trafilatura
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
class WebPageProcessor:
def extract_article(self, url):
"""Extract main article content from web page"""
# Download page
response = requests.get(url, timeout=10)
html = response.text
# Use trafilatura for main content extraction
# (removes nav, ads, footers automatically)
text = trafilatura.extract(
html,
include_comments=False,
include_tables=True,
output_format="txt"
)
# Extract metadata using BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
metadata = {
"url": url,
"domain": urlparse(url).netloc,
"title": self._get_title(soup),
"description": self._get_meta_description(soup),
"author": self._get_author(soup),
"publish_date": self._get_publish_date(soup),
"lang": self._get_language(soup)
}
return text, metadata
def _get_title(self, soup):
# Try multiple methods
if soup.title:
return soup.title.string
og_title = soup.find("meta", property="og:title")
if og_title:
return og_title.get("content")
return ""
def _get_meta_description(self, soup):
desc = soup.find("meta", {"name": "description"})
if desc:
return desc.get("content")
og_desc = soup.find("meta", property="og:description")
if og_desc:
return og_desc.get("content")
return ""
def _get_author(self, soup):
# Common author meta tags
author = soup.find("meta", {"name": "author"})
if author:
return author.get("content")
return ""
def _get_publish_date(self, soup):
# Look for common date patterns
date_meta = soup.find("meta", property="article:published_time")
if date_meta:
return date_meta.get("content")
return ""
def _get_language(self, soup):
html_tag = soup.find("html")
if html_tag and html_tag.get("lang"):
return html_tag.get("lang")
return "en" # default
## Usage
processor = WebPageProcessor()
text, metadata = processor.extract_article("https://example.com/article")
print(f"Title: {metadata['title']}")
print(f"Author: {metadata['author']}")
print(f"Content length: {len(text)} chars")
Why this works:
- π trafilatura uses ML-based content extraction (not just CSS selectors)
- π Handles diverse site layouts without custom rules
- π Extracts structured metadata from common meta tags
- π Falls back gracefully when metadata is missing
Example 3: Multi-Format Document Loader
A unified loader that handles multiple document types:
from pathlib import Path
import mimetypes
from typing import Dict, List
import docx
import pandas as pd
import pymupdf
class UniversalDocumentLoader:
def __init__(self):
self.handlers = {
'application/pdf': self._load_pdf,
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': self._load_docx,
'text/plain': self._load_txt,
'text/html': self._load_html,
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': self._load_xlsx
}
def load(self, file_path: str) -> Dict:
"""Load document and return text + metadata"""
path = Path(file_path)
# Detect MIME type
mime_type, _ = mimetypes.guess_type(file_path)
if mime_type not in self.handlers:
raise ValueError(f"Unsupported file type: {mime_type}")
# Call appropriate handler
handler = self.handlers[mime_type]
text, format_specific_meta = handler(path)
# Add common metadata
metadata = {
"source": str(path),
"filename": path.name,
"format": mime_type,
"size_bytes": path.stat().st_size,
**format_specific_meta
}
return {"text": text, "metadata": metadata}
def _load_pdf(self, path: Path) -> tuple:
doc = pymupdf.open(path)
text = "\n\n".join([page.get_text() for page in doc])
meta = {
"page_count": len(doc),
"title": doc.metadata.get("title", "")
}
doc.close()
return text, meta
def _load_docx(self, path: Path) -> tuple:
doc = docx.Document(path)
text = "\n\n".join([para.text for para in doc.paragraphs])
meta = {
"paragraph_count": len(doc.paragraphs)
}
return text, meta
def _load_txt(self, path: Path) -> tuple:
with open(path, 'r', encoding='utf-8') as f:
text = f.read()
return text, {}
def _load_html(self, path: Path) -> tuple:
from bs4 import BeautifulSoup
with open(path, 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f.read(), 'html.parser')
# Remove script and style tags
for script in soup(["script", "style"]):
script.decompose()
text = soup.get_text(separator='\n')
return text, {"title": soup.title.string if soup.title else ""}
def _load_xlsx(self, path: Path) -> tuple:
# Load all sheets
xlsx = pd.ExcelFile(path)
texts = []
for sheet_name in xlsx.sheet_names:
df = pd.read_excel(xlsx, sheet_name=sheet_name)
# Convert to markdown table
texts.append(f"## Sheet: {sheet_name}\n\n")
texts.append(df.to_markdown(index=False))
return "\n\n".join(texts), {"sheet_count": len(xlsx.sheet_names)}
## Usage
loader = UniversalDocumentLoader()
for file in ["report.pdf", "notes.docx", "data.xlsx"]:
result = loader.load(file)
print(f"Loaded {result['metadata']['filename']}: {len(result['text'])} chars")
Design benefits:
- π― Single interface for multiple formats
- π― Automatic format detection via MIME types
- π― Extensible: Add new handlers easily
- π― Consistent output: All handlers return same structure
Example 4: Semantic Chunking with Embeddings
Advanced chunking that groups semantically related sentences:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('punkt', quiet=True)
class SemanticChunker:
def __init__(self, model_name='all-MiniLM-L6-v2', similarity_threshold=0.5):
self.model = SentenceTransformer(model_name)
self.threshold = similarity_threshold
def chunk_text(self, text: str, max_chunk_size: int = 500) -> List[str]:
"""Split text into semantically coherent chunks"""
# Split into sentences
sentences = nltk.sent_tokenize(text)
# Compute embeddings for all sentences
embeddings = self.model.encode(sentences)
# Build chunks by grouping similar adjacent sentences
chunks = []
current_chunk = [sentences[0]]
current_embedding = embeddings[0]
for i in range(1, len(sentences)):
sentence = sentences[i]
sentence_embedding = embeddings[i]
# Calculate similarity to current chunk
similarity = cosine_similarity(
[current_embedding],
[sentence_embedding]
)[0][0]
# Check if chunk would exceed size
current_text = " ".join(current_chunk)
would_exceed = len(current_text) + len(sentence) > max_chunk_size
if similarity >= self.threshold and not would_exceed:
# Add to current chunk
current_chunk.append(sentence)
# Update chunk embedding (running average)
current_embedding = np.mean([current_embedding, sentence_embedding], axis=0)
else:
# Start new chunk
chunks.append(" ".join(current_chunk))
current_chunk = [sentence]
current_embedding = sentence_embedding
# Add final chunk
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
## Usage
chunker = SemanticChunker(similarity_threshold=0.6)
text = """
Artificial intelligence is transforming healthcare. AI systems can analyze medical images.
Doctors use AI to detect diseases early. Machine learning models predict patient outcomes.
Climate change poses serious risks. Rising temperatures affect ecosystems.
Extreme weather events are becoming more frequent. Scientists urge immediate action.
"""
chunks = chunker.chunk_text(text, max_chunk_size=200)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}\n")
Output:
Chunk 1: Artificial intelligence is transforming healthcare. AI systems can analyze medical images. Doctors use AI to detect diseases early. Machine learning models predict patient outcomes.
Chunk 2: Climate change poses serious risks. Rising temperatures affect ecosystems. Extreme weather events are becoming more frequent. Scientists urge immediate action.
Notice: Sentences about AI/healthcare stayed together, separate from climate change sentences, because semantic chunking detected the topic shift.
Common Mistakes to Avoid β οΈ
1. Ignoring Character Encoding
Problem: Assuming all text is UTF-8 leads to garbled output.
## β WRONG: Assumes UTF-8
with open('document.txt', 'r') as f:
text = f.read() # UnicodeDecodeError!
## β
RIGHT: Detect encoding first
import chardet
with open('document.txt', 'rb') as f:
raw_data = f.read()
detected = chardet.detect(raw_data)
encoding = detected['encoding']
with open('document.txt', 'r', encoding=encoding) as f:
text = f.read()
2. Losing Document Structure
Problem: Treating all text as a flat blob loses hierarchical information.
## β WRONG: All sections mixed together
text = " ".join([p.text for p in doc.paragraphs])
## β
RIGHT: Preserve headings and structure
structured_text = []
for para in doc.paragraphs:
if para.style.name.startswith('Heading'):
level = para.style.name[-1] # Heading 1, 2, 3...
structured_text.append(f"\n{'#' * int(level)} {para.text}\n")
else:
structured_text.append(para.text)
3. Over-Chunking or Under-Chunking
Problem: Chunks too small lack context; too large hurt retrieval precision.
## β WRONG: Chunks of 50 tokens (too small)
## Each chunk: "The model uses attention. It processes sequences."
## Missing context!
## β WRONG: Chunks of 5000 tokens (too large)
## Returns entire chapter when user asks specific question
## β
RIGHT: 200-512 tokens, with overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=50
)
4. Neglecting Metadata
Problem: Storing only text makes filtering and debugging impossible.
## β WRONG: Just text
chunks = ["text1", "text2", "text3"]
## β
RIGHT: Rich metadata
chunks = [
{
"text": "text1",
"source": "doc123.pdf",
"page": 5,
"chunk_id": "abc123",
"doc_type": "technical_manual",
"created_at": "2024-01-15"
},
# ...
]
5. Not Validating Extraction Quality
Problem: Processing continues with corrupted/empty text.
## β WRONG: No validation
text = extract_text(file)
chunks = split(text) # What if text is empty or garbled?
## β
RIGHT: Validate before proceeding
text = extract_text(file)
if len(text) < 50:
raise ValueError(f"Extracted text too short: {len(text)} chars")
if text.count('οΏ½') > len(text) * 0.01: # >1% replacement chars
raise ValueError("Text contains too many invalid characters")
## Check language if expected
if detect_language(text) != 'en':
logger.warning(f"Expected English, got {detect_language(text)}")
6. Hardcoding File Paths
Problem: Non-portable code that breaks across environments.
## β WRONG: Absolute Windows path
file = "C:\\Users\\John\\Documents\\data.pdf"
## β
RIGHT: Relative paths or environment variables
from pathlib import Path
import os
data_dir = Path(os.getenv('DATA_DIR', './data'))
file = data_dir / 'documents' / 'data.pdf'
7. Forgetting Error Recovery
Problem: One bad document crashes entire pipeline.
## β WRONG: No error handling
for file in files:
process(file) # Crash on first error!
## β
RIGHT: Graceful error handling
for file in files:
try:
process(file)
except Exception as e:
logger.error(f"Failed to process {file}: {e}")
failed_files.append((file, str(e)))
continue # Process remaining files
## Report failures at end
if failed_files:
print(f"Failed to process {len(failed_files)} files")
Key Takeaways π―
π Document Processing Quick Reference
| Concept | Key Points |
|---|---|
| Format Handling | Use specialized libraries per format (PyMuPDF for PDF, python-docx for DOCX). Detect MIME types automatically. Always validate extraction quality. |
| Text Cleaning | Remove boilerplate, normalize whitespace and unicode. Preserve domain-specific terms. Balance cleaning vs. information loss. |
| Chunking | Target 200-512 tokens per chunk. Use recursive splitting for natural boundaries. Add 50-100 token overlap. Validate chunk completeness. |
| Metadata | Extract source, temporal, authorship data. Enrich with NER, topics, categories. Enable filtering and traceability. |
| Special Content | Linearize tables with headers. Preserve code formatting. OCR images. Caption all visual elements. |
| Quality Control | Validate encoding, length, language. Log errors and failures. Monitor processing metrics. Handle errors gracefully. |
Golden Rules:
- π Always preserve context: Chunks must be understandable independently
- π Metadata is not optional: It enables filtering, debugging, and audit trails
- π Validate early and often: Catch extraction errors before they propagate
- π Design for failure: One bad document shouldn't break your entire pipeline
- π Test on real data: Sample documents reveal edge cases synthetic data misses
Document Processing Pipeline Architecture
END-TO-END DOCUMENT PROCESSING PIPELINE
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT SOURCES β
β π File System βοΈ Cloud Storage π APIs π§ Email β
βββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. INGESTION LAYER β
β β’ Format detection (MIME types) β
β β’ File validation (size, readability) β
β β’ Queue management (async processing) β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. EXTRACTION LAYER β
β β’ PDF β PyMuPDF / pdfplumber β
β β’ DOCX β python-docx β
β β’ HTML β trafilatura / BeautifulSoup β
β β’ Images β Tesseract OCR β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. CLEANING LAYER β
β β’ Unicode normalization (NFKC) β
β β’ Boilerplate removal (headers/footers) β
β β’ Whitespace normalization β
β β’ Encoding error fixing β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. CHUNKING LAYER β
β β’ Strategy selection (fixed/recursive/semantic) β
β β’ Size validation (200-512 tokens) β
β β’ Overlap application (50-100 tokens) β
β β’ Boundary detection (sentences/paragraphs) β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. ENRICHMENT LAYER β
β β’ Metadata extraction (dates, authors) β
β β’ Language detection β
β β’ Named entity recognition (NER) β
β β’ Topic classification β
β β’ Unique ID generation β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 6. VALIDATION LAYER β
β β’ Length checks (min/max) β
β β’ Language validation β
β β’ Character diversity checks β
β β’ Metadata completeness β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββ΄ββββββ
β β
β
VALID β INVALID
β β
β β
β π Error Queue
β (Manual Review)
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OUTPUT STORAGE β
β πΎ Vector DB π Document Store ποΈ Metadata DB β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Pro tip: Implement this pipeline with message queues (e.g., Kafka, RabbitMQ) for scalability. Each layer can scale independently based on load.
Performance Optimization Tips π
- Parallel processing: Process multiple documents concurrently
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as executor:
results = executor.map(process_document, file_list)
- Batch operations: Group similar operations (e.g., embed multiple chunks at once)
## β
GOOD: Batch embedding
embeddings = model.encode(chunk_texts, batch_size=32)
## β SLOW: Individual embedding
embeddings = [model.encode(text) for text in chunk_texts]
- Caching: Store extraction results to avoid reprocessing
import hashlib
import pickle
def get_cache_key(file_path):
with open(file_path, 'rb') as f:
file_hash = hashlib.md5(f.read()).hexdigest()
return f"extracted_{file_hash}"
## Check cache before processing
if cache_key in cache:
return cache[cache_key]
- Stream large files: Don't load entire file into memory
## β
GOOD: Stream processing
for page_num in range(len(pdf_doc)):
page = pdf_doc[page_num]
process_page(page)
# Page released from memory
## β BAD: Load all at once
all_pages = [page.get_text() for page in pdf_doc] # Memory spike!
Further Study π
Deepen your document processing expertise with these resources:
LangChain Documentation - Document Loaders: Comprehensive guide to document loaders and text splitters with code examples
https://python.langchain.com/docs/modules/data_connection/document_loaders/Unstructured.io Documentation: Open-source library for preprocessing diverse document types (PDFs, images, HTML) for LLM applications
https://unstructured-io.github.io/unstructured/Apache Tika: Powerful toolkit for detecting and extracting metadata/text from 1000+ file types, with Python bindings
https://tika.apache.org/
Next Steps: Now that you've mastered document processing, the next node in the roadmap covers Embedding Generationβtransforming your processed text chunks into vector representations for semantic search. You'll learn about embedding models, dimensionality considerations, and batch processing strategies! π―