Production Deployment & Optimization
Scale RAG systems to production with monitoring, caching, security, and cost optimization strategies.
Production Deployment & Optimization
Master production deployment and optimization for AI search systems with free flashcards and spaced repetition practice. This lesson covers containerization strategies, scalability patterns, monitoring infrastructure, and performance optimization techniquesโessential concepts for deploying robust RAG applications in production environments.
Welcome to Production Deployment ๐
Deploying an AI search or RAG system to production isn't just about copying code to a server. It requires careful consideration of scalability, reliability, observability, and cost optimization. A system that works perfectly on your laptop can fail spectacularly under real-world load without proper deployment architecture.
This lesson walks you through the complete journey from development to production, covering containerization, orchestration, monitoring, and optimization strategies that distinguish hobby projects from enterprise-grade deployments.
Core Concepts: Architecture & Infrastructure ๐๏ธ
Containerization with Docker
Containerization packages your application and all its dependencies into isolated, reproducible units called containers. For AI search systems, this is critical because:
- Dependency management: Vector databases, embedding models, and RAG pipelines have complex dependencies
- Reproducibility: "Works on my machine" becomes "works everywhere"
- Resource isolation: Prevent memory leaks in one component from crashing others
- Version control: Roll back problematic deployments instantly
A typical RAG application uses multi-stage Docker builds to optimize image size:
| Stage | Purpose | Base Image |
|---|---|---|
| Builder | Compile code, install dependencies | python:3.11-slim |
| Runtime | Run application with minimal footprint | python:3.11-alpine |
๐ก Pro tip: Use .dockerignore to exclude model weights and vector databases from your image. Mount these as volumes insteadโrebuilding 5GB images for code changes wastes time and storage.
Container Orchestration with Kubernetes โ๏ธ
Kubernetes (K8s) manages container deployment, scaling, and networking across clusters. For RAG systems, key K8s resources include:
Deployments - Manage stateless components like API servers:
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-api
spec:
replicas: 3
selector:
matchLabels:
app: rag-api
template:
spec:
containers:
- name: api
image: rag-api:v1.2.0
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
StatefulSets - Manage stateful components like vector databases with persistent identities and storage.
Services - Expose applications internally (ClusterIP) or externally (LoadBalancer).
Horizontal Pod Autoscaler (HPA) - Automatically scale based on CPU, memory, or custom metrics like query latency:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ KUBERNETES AUTOSCALING โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ ๐ Metrics Server โ โ โ โ โ โ โ โ ๐ฏ HPA Controller โ โ โ โ โ โโโโโโโดโโโโโโ โ โ โ โ โ โ Scale Up Scale Down โ โ (CPU>80%) (CPU<30%) โ โ โ โ โ โ โ โ โ โ [Pod] [Pod] [Pod] โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Managed Services vs Self-Hosted ๐ค
Choosing between managed services and self-hosted infrastructure impacts cost, control, and operational burden:
| Component | Managed Option | Self-Hosted Option |
|---|---|---|
| Vector DB | Pinecone, Weaviate Cloud | Qdrant, Milvus on K8s |
| Embedding API | OpenAI, Cohere | Sentence-Transformers on GPU |
| LLM | OpenAI API, Anthropic | vLLM, TGI with Llama |
| Orchestration | AWS EKS, GKE, AKS | Self-managed K8s cluster |
๐ก Decision framework: Start with managed services for faster time-to-market. Self-host when:
- Monthly API costs exceed infrastructure costs by 3x+
- Data residency requires on-premises deployment
- You need sub-100ms latency (API calls add 50-200ms)
- Your team has strong DevOps capabilities
Scalability Patterns ๐
Horizontal vs Vertical Scaling
Vertical scaling (scaling up) adds resources to existing instances:
- โ Simple to implement
- โ No code changes needed
- โ Hardware limits (max CPU/RAM)
- โ Single point of failure
- โ Expensive at scale
Horizontal scaling (scaling out) adds more instances:
- โ Nearly unlimited scaling
- โ Built-in redundancy
- โ Cost-effective with commodity hardware
- โ Requires stateless design
- โ Needs load balancing
VERTICAL SCALING HORIZONTAL SCALING
โโโโโโโโโโโโโโโ โโโโโ โโโโโ โโโโโ
โ 32 CPU โ โ 8 โ โ 8 โ โ 8 โ
โ 256GB RAM โ VS โCPUโ โCPUโ โCPUโ
โ 1 Server โ โโโโโ โโโโโ โโโโโ
โโโโโโโโโโโโโโโ 3 Servers
โ โ
Limited Load Balanced
Load Balancing Strategies ๐ฏ
For RAG systems, intelligent load balancing goes beyond simple round-robin:
Layer 7 (Application) Load Balancing routes based on request content:
- Path-based:
/searchโ query service,/embedโ embedding service - Header-based: Route premium users to GPU instances
- Content-based: Route complex queries to high-memory nodes
Least Outstanding Requests (LOR) routing sends requests to the instance with fewest active connectionsโcritical for variable-latency RAG queries where simple round-robin creates imbalance.
Sticky sessions with consistent hashing ensure follow-up questions in a conversation hit the same backend (preserving conversation cache).
Caching Layers ๐พ
Multi-tier caching dramatically reduces costs and latency:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ CACHING ARCHITECTURE โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ โ โ โก L1: Application Cache (in-memory) โ โ Redis: 0.5-2ms latency โ โ Cache: Embeddings, frequent queries โ โ โ โ โ โ (miss) โ โ ๐พ L2: Vector DB Cache โ โ 10-50ms latency โ โ Cache: Recent search results โ โ โ โ โ โ (miss) โ โ ๐ L3: Full Vector Search โ โ 50-500ms latency โ โ Full semantic search + rerank โ โ โ โ โ โ (new query) โ โ ๐ค L4: LLM Generation โ โ 1-5s latency โ โ Generate answer from context โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Cache invalidation strategies:
- Time-based (TTL): Expire embeddings after 24h
- Event-based: Invalidate on document updates
- LRU (Least Recently Used): Evict cold data when cache is full
๐ง Memory device: "CLEAR caching" - Consistent hashing, LRU eviction, Event-based invalidation, Application-level cache, Redis for speed.
Database Optimization ๐๏ธ
Vector database indexing trades accuracy for speed:
| Index Type | Search Speed | Recall | Use Case |
|---|---|---|---|
| HNSW | Fast (10-50ms) | 95-99% | Production default |
| IVF | Medium (20-100ms) | 90-95% | Large datasets |
| Flat (exact) | Slow (100ms-1s) | 100% | Small datasets, research |
| Product Quantization | Very fast (5-20ms) | 85-92% | Memory-constrained |
Connection pooling prevents database exhaustion:
## Configure connection pool sizing
pool_size = (available_connections ร 0.8) / num_app_instances
max_overflow = pool_size ร 0.3
## Example: 100 DB connections, 5 app instances
## pool_size = (100 ร 0.8) / 5 = 16 per instance
## max_overflow = 16 ร 0.3 = 5 burst connections
Read replicas distribute query load across multiple database instances while maintaining a single write primary.
Monitoring & Observability ๐
The Three Pillars
Observability consists of three complementary data types:
1. Metrics - Aggregated numerical measurements over time:
- Request rate (requests/second)
- Latency percentiles (p50, p90, p95, p99)
- Error rate (%)
- Resource utilization (CPU, memory, GPU)
- Token consumption (for LLM APIs)
2. Logs - Discrete event records:
- Application logs (errors, warnings)
- Access logs (who queried what, when)
- Audit logs (compliance, security)
- Debug logs (troubleshooting)
3. Traces - Request journey through distributed systems:
DISTRIBUTED TRACE: RAG Query Flow
User Request [span: 3542ms]
โ
โโโ API Gateway [span: 15ms]
โ โ
โ โโโ Auth Service [span: 8ms]
โ โ
โ โโโ Rate Limiter [span: 2ms]
โ
โโโ Embedding Service [span: 145ms]
โ โ
โ โโโ Model Inference [span: 138ms]
โ
โโโ Vector Search [span: 89ms]
โ โ
โ โโโ Index Query [span: 45ms]
โ โ
โ โโโ Reranking [span: 38ms]
โ
โโโ LLM Generation [span: 3280ms]
โ
โโโ Context Assembly [span: 12ms]
โ
โโโ OpenAI API Call [span: 3265ms]
Bottleneck: LLM generation (93% of total time)
Key Metrics for RAG Systems ๐
Monitor these golden signals specific to AI search:
| Metric | Target | Alert Threshold |
|---|---|---|
| Query latency (p95) | <2s | >5s |
| Embedding latency (p95) | <200ms | >500ms |
| Vector search recall@10 | >90% | <80% |
| Cache hit rate | >60% | <40% |
| GPU utilization | 70-90% | <30% or >95% |
| Error rate | <0.1% | >1% |
| Token cost per query | Varies | +50% spike |
Custom metrics for business intelligence:
- Average tokens per response (cost control)
- Retrieval confidence scores (answer quality)
- User satisfaction ratings (feedback loops)
- Query complexity distribution (capacity planning)
Alerting Strategy ๐จ
Alert fatigue kills on-call effectiveness. Design alerts using SLO-based alerting:
- Define Service Level Objectives (SLOs): "95% of queries complete in <2s"
- Set error budgets: Allowed failure rate over time window
- Alert when budget burn rate indicates SLO violation imminent
Alert severity levels:
- ๐ด Critical: User-facing outage, page immediately
- ๐ Warning: Degraded performance, notify during business hours
- ๐ก Info: Unusual pattern, log for investigation
โ ๏ธ Common mistake: Alerting on symptoms instead of impact. Alert on "API returning 500 errors" (impact) not "disk 80% full" (symptom).
Logging Best Practices ๐
Structured logging enables automated analysis:
{
"timestamp": "2026-03-15T14:32:18.123Z",
"level": "INFO",
"service": "rag-api",
"trace_id": "a7b3c9d2e1f4",
"user_id": "usr_789",
"query": "What are transformer architectures?",
"latency_ms": 1847,
"retrieved_docs": 5,
"llm_tokens": 312,
"cache_hit": false
}
Log levels hierarchy:
- DEBUG โ Verbose developer info (disabled in prod)
- INFO โ Normal operations (successful queries)
- WARN โ Degraded but functional (fallback to cached results)
- ERROR โ Failed operations (query timeout)
- CRITICAL โ System failure (database unreachable)
๐ก Privacy consideration: Never log PII (personal identifiable information) or full user queries without hashing/anonymization. Use query_hash instead of full text.
Performance Optimization โก
Model Optimization Techniques
Quantization reduces model size and increases inference speed by using lower-precision numbers:
| Precision | Size | Speed | Quality Loss |
|---|---|---|---|
| FP32 (baseline) | 100% | 1x | 0% |
| FP16 | 50% | 2-3x | <1% |
| INT8 | 25% | 3-4x | 1-3% |
| INT4 | 12.5% | 4-6x | 3-8% |
For embedding models, INT8 quantization typically maintains 98%+ quality while doubling throughput.
Model distillation creates smaller "student" models that mimic larger "teacher" models:
- Distil-BERT: 40% smaller, 60% faster than BERT, retains 97% performance
- MiniLM: 5x faster inference than RoBERTa base
Batch processing amortizes overhead across multiple requests:
SINGLE REQUEST BATCH PROCESSING โโโโโ 50ms โโโโโ โ 1 โ โโโโโ โ 1 โ โโโโโ โ 2 โ 80ms total โโโโโ 50ms VS โ 3 โ = 26.7ms avg โ 2 โ โโโโโ โ 4 โ โโโโโ โโโโโ โโโโโ 50ms โ 3 โ โโโโโ Batch size = 4 โโโโโ 40% efficiency gain
GPU Optimization ๐ฎ
For self-hosted models, GPU utilization determines ROI:
Tensor parallelism splits single model across multiple GPUs (required for 70B+ parameter models).
Pipeline parallelism distributes model layers across GPUsโlayer 1-10 on GPU1, layers 11-20 on GPU2.
Dynamic batching with continuous batching (pioneered by vLLM) allows adding new requests to in-flight batches, maximizing GPU utilization from 30% to 80%+.
KV-cache optimization: Transformer models cache key-value pairs during generation. Optimize by:
- Paged attention: Store KV cache in non-contiguous memory pages
- Prefix caching: Reuse cached system prompts across requests
Network Optimization ๐
CDN for static assets: Serve embedding model files via CloudFront/CloudFlare instead of S3 directly (20-50ms saved per model load).
gRPC over REST for service-to-service communication:
- Binary protocol (smaller payloads)
- HTTP/2 multiplexing (reduced connections)
- Built-in load balancing
- 2-5x faster than JSON over HTTP/1.1
Connection keep-alive: Reuse HTTP connections to avoid handshake overhead (saves 50-200ms per request).
Cost Optimization ๐ฐ
Balancing performance and cost requires continuous monitoring:
Right-sizing instances: Over-provisioning wastes money, under-provisioning creates bottlenecks.
| Metric | Under-provisioned | Right-sized | Over-provisioned |
|---|---|---|---|
| CPU utilization | >90% | 60-80% | <30% |
| Memory usage | >85% | 60-75% | <40% |
| GPU utilization | >95% | 70-85% | <40% |
Spot instances for non-critical workloads (embedding batch jobs) save 70-90% compared to on-demand.
Auto-scaling policies with scheduled scaling handle predictable load patterns (scale up at 9am, down at 6pm).
Token budgeting for LLM APIs:
## Set per-user monthly token limits
user_token_limit = 100_000 # ~$0.20 at GPT-4 prices
## Implement exponential backoff for retries
## Cache aggressively (60%+ hit rate = 60% cost savings)
## Use cheaper models for simple queries (GPT-3.5 vs GPT-4)
Practical Examples
Example 1: Production-Ready Docker Setup ๐ณ
Here's a multi-stage Dockerfile for a RAG API optimized for production:
## Stage 1: Builder - install dependencies
FROM python:3.11-slim as builder
WORKDIR /app
## Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*
## Copy only requirements first (caching layer)
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
## Stage 2: Runtime - minimal production image
FROM python:3.11-slim
WORKDIR /app
## Copy only necessary files from builder
COPY --from=builder /root/.local /root/.local
COPY ./src ./src
COPY ./config ./config
## Security: run as non-root user
RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
USER appuser
## Add local binaries to PATH
ENV PATH=/root/.local/bin:$PATH
## Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
## Run with gunicorn for production
CMD ["gunicorn", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", \
"--bind", "0.0.0.0:8000", "--timeout", "120", "src.main:app"]
Key optimizations:
- Multi-stage build: Final image 60% smaller
- Layer caching: Rebuilds take 10s instead of 5min
- Non-root user: Security best practice
- Health checks: Kubernetes auto-restarts failed containers
- Gunicorn: Production WSGI server with worker management
Example 2: Kubernetes Deployment with Autoscaling ๐
Complete K8s configuration for a scalable RAG API:
## Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-api
labels:
app: rag-api
spec:
replicas: 3
selector:
matchLabels:
app: rag-api
template:
metadata:
labels:
app: rag-api
spec:
containers:
- name: api
image: myregistry/rag-api:v2.1.0
ports:
- containerPort: 8000
env:
- name: VECTOR_DB_URL
valueFrom:
secretKeyRef:
name: db-secrets
key: connection_url
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: api-keys
key: openai_key
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
---
## Service
apiVersion: v1
kind: Service
metadata:
name: rag-api-service
spec:
type: LoadBalancer
selector:
app: rag-api
ports:
- protocol: TCP
port: 80
targetPort: 8000
---
## Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: rag-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: rag-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scaling down
policies:
- type: Percent
value: 50 # Scale down max 50% at once
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Scale up max 100% at once
periodSeconds: 30
Why this works: The HPA monitors CPU and memory, scaling from 3 to 20 pods automatically. Conservative scale-down prevents thrashing during traffic fluctuations.
Example 3: Monitoring with Prometheus & Grafana ๐
Instrument your application with metrics:
from prometheus_client import Counter, Histogram, Gauge
import time
## Define metrics
query_counter = Counter(
'rag_queries_total',
'Total number of RAG queries',
['status', 'user_tier']
)
query_latency = Histogram(
'rag_query_duration_seconds',
'Query latency in seconds',
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
active_queries = Gauge(
'rag_active_queries',
'Number of queries currently processing'
)
llm_tokens = Counter(
'rag_llm_tokens_total',
'Total LLM tokens consumed',
['model']
)
## Instrument your endpoint
@app.post("/query")
async def query_endpoint(query: str, user_tier: str):
active_queries.inc() # Increment active queries
start_time = time.time()
try:
# Perform RAG query
result = await perform_rag_query(query)
# Track token usage
llm_tokens.labels(model='gpt-4').inc(result['tokens_used'])
# Record success
query_counter.labels(status='success', user_tier=user_tier).inc()
return result
except Exception as e:
# Record failure
query_counter.labels(status='error', user_tier=user_tier).inc()
raise
finally:
# Always record latency and decrement active queries
query_latency.observe(time.time() - start_time)
active_queries.dec()
Grafana dashboard queries:
- Query rate:
rate(rag_queries_total[5m]) - P95 latency:
histogram_quantile(0.95, rag_query_duration_seconds) - Error rate:
rate(rag_queries_total{status="error"}[5m]) / rate(rag_queries_total[5m]) - Cost estimate:
rate(rag_llm_tokens_total[1h]) * 0.00003# GPT-4 pricing
Example 4: Implementing Circuit Breakers ๐
Prevent cascading failures when downstream services (like LLM APIs) fail:
from circuitbreaker import circuit
import logging
logger = logging.getLogger(__name__)
@circuit(
failure_threshold=5, # Open after 5 failures
recovery_timeout=60, # Try again after 60s
expected_exception=Exception
)
async def call_llm_api(prompt: str, context: str):
"""
Circuit breaker wrapper for LLM API calls.
States:
- CLOSED: Normal operation, requests pass through
- OPEN: Failures exceeded threshold, requests fail fast
- HALF_OPEN: Testing if service recovered
"""
try:
response = await openai_client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": context},
{"role": "user", "content": prompt}
],
timeout=30
)
return response.choices[0].message.content
except Exception as e:
logger.error(f"LLM API call failed: {e}")
raise
## Use with fallback
async def generate_answer(query: str, context: str):
try:
return await call_llm_api(query, context)
except Exception:
# Circuit open - return cached/degraded response
logger.warning("Circuit breaker open, using fallback")
return await get_cached_similar_answer(query)
Benefits: When OpenAI API goes down, your system fails fast (no 30s timeouts) and serves cached responses instead of cascading failures.
CIRCUIT BREAKER STATE MACHINE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โ
โโโโโโโโโโโ 5 failures โโโโโโโโ
โ CLOSED โโโโโโโโโโโโโโโโโโโโ OPEN โ
โ Normal โ โ Fail โ
โ ops โโโโโโโโโโโโโโโโโโ โ fast โ
โโโโโโโโโโโ โ โโโโโโโโ
โ โ โ
โ โ โ 60s timeout
โ โ โ
โ Success โ โผ
โ โ โโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโดโโHALF-OPEN โ
โ Testing โ
โโโโโโโโโโโโ
Common Mistakes โ ๏ธ
Mistake 1: No Resource Limits
Problem: Deploying containers without CPU/memory limits allows a single misbehaving container to consume all host resources, crashing other services.
Solution: Always set both requests (reserved) and limits (maximum) in Kubernetes. Monitor actual usage and adjust over time.
Mistake 2: Synchronous Everything
Problem: Making synchronous API calls to embedding services and LLMs blocks workers, reducing throughput by 10-20x.
Solution: Use async/await patterns:
## โ BAD: Synchronous - blocks worker
def process_query(query):
embedding = embedding_api.create(query) # 100ms blocked
results = vector_db.search(embedding) # 50ms blocked
answer = llm_api.generate(results) # 2000ms blocked
return answer # Total: 2150ms per worker
## โ
GOOD: Asynchronous - non-blocking
async def process_query(query):
embedding = await embedding_api.create(query) # Other requests process
results = await vector_db.search(embedding)
answer = await llm_api.generate(results)
return answer # Same latency, 10x throughput
Mistake 3: Logging Everything
Problem: Verbose logging in production creates multi-GB log files daily, overwhelming storage and making debugging harder (finding needles in haystacks).
Solution:
- Use INFO level in production (not DEBUG)
- Sample high-frequency events (log 1% of successful queries)
- Aggregate metrics instead of individual logs
Mistake 4: No Health Checks
Problem: Load balancers route traffic to failed instances, causing 50% error rates until manual intervention.
Solution: Implement /health (liveness) and /ready (readiness) endpoints:
@app.get("/health")
async def health_check():
"""Returns 200 if service is alive (basic)"""
return {"status": "healthy"}
@app.get("/ready")
async def readiness_check():
"""Returns 200 only if service can handle requests"""
checks = {
"database": await check_db_connection(),
"vector_db": await check_vector_db(),
"embedding_model": await check_model_loaded()
}
if all(checks.values()):
return {"status": "ready", "checks": checks}
else:
raise HTTPException(status_code=503, detail="Not ready")
Mistake 5: Ignoring Cold Start Times
Problem: First requests after deployment take 30-60s while models load, causing timeouts and poor user experience.
Solution:
- Pre-warm instances: Make dummy requests during initialization
- Use readiness probes with sufficient
initialDelaySeconds - Implement rolling deployments (new pods ready before old ones terminate)
- Consider keeping model in memory (not loading per-request)
Mistake 6: No Rate Limiting
Problem: Single user makes 1000 req/sec, overwhelming system and causing outage for everyone. Or worseโrunaway retry loops amplify failures.
Solution: Implement tiered rate limiting:
from fastapi import HTTPException
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
@app.post("/query")
@limiter.limit("10/minute") # Free tier
async def query_free_tier(request: Request):
# ...
@app.post("/query/premium")
@limiter.limit("100/minute") # Paid tier
async def query_premium(request: Request):
# ...
Mistake 7: Deploying Without Testing Under Load
Problem: System works perfectly in testing, then crashes under production load due to connection pool exhaustion, memory leaks, or database locks.
Solution: Load test before production:
## Use tools like Locust or k6
k6 run --vus 100 --duration 10m load-test.js
## Monitor during test:
## - Response times (watch for degradation)
## - Error rates (should stay <0.1%)
## - Resource usage (CPU, memory, connections)
## - Database performance (query times, locks)
Gradually increase load: 10 users โ 50 โ 100 โ 500, identifying breaking points.
Key Takeaways ๐ฏ
โ Containerization is essential - Docker ensures reproducible deployments across environments
โ Kubernetes provides orchestration - Automated scaling, self-healing, and resource management
โ Choose managed vs self-hosted strategically - Start managed, self-host when cost/latency/control justify operational complexity
โ Scale horizontally - Add more instances rather than bigger instances for better reliability and cost
โ Implement multi-tier caching - 60%+ cache hit rates reduce costs and latency dramatically
โ Monitor the three pillars - Metrics, logs, and traces provide complete system observability
โ Optimize models carefully - Quantization and distillation improve performance with minimal quality loss
โ Set resource limits - Prevent resource exhaustion and enable efficient bin-packing
โ Use circuit breakers - Fail fast when downstream services fail instead of cascading failures
โ Load test thoroughly - Discover breaking points in staging, not production
๐ Quick Reference Card: Production Deployment Checklist
| โ Containerization | Multi-stage Dockerfile, .dockerignore, health checks |
| โ Orchestration | K8s Deployment, Service, HPA configured |
| โ Resource Management | CPU/memory requests & limits set |
| โ Scaling | Horizontal autoscaling, load balancing configured |
| โ Caching | Redis/in-memory cache with 60%+ hit rate |
| โ Monitoring | Metrics, logs, traces with alerts |
| โ Optimization | Model quantization, batching, connection pooling |
| โ Resilience | Circuit breakers, retries, timeouts |
| โ Security | Non-root containers, secrets management, rate limiting |
| โ Testing | Load tests passing at 2x expected traffic |
๐ Further Study
Kubernetes Official Documentation - https://kubernetes.io/docs/home/ - Comprehensive K8s guides and best practices
Prometheus & Grafana Tutorials - https://prometheus.io/docs/introduction/overview/ - Learn metrics collection and visualization
Vector Database Performance Benchmarks - https://github.com/erikbern/ann-benchmarks - Compare vector index performance across databases
๐ You now have the knowledge to deploy production-grade AI search and RAG systems. Focus on reliability first, then optimize for performance and cost as you scale!