Data Freshness & Lifecycle
Re-embedding strategies TTLs Stale chunk detection
Data Freshness & Lifecycle
Master data freshness and lifecycle management with free flashcards and spaced repetition practice. This lesson covers data staleness detection, incremental updates, and Time-To-Live (TTL) strategiesβessential concepts for building production-grade AI search and RAG systems that deliver accurate, timely results.
Welcome to Data Freshness & Lifecycle π
In modern AI search and RAG (Retrieval-Augmented Generation) systems, data freshness isn't just a nice-to-have featureβit's a critical requirement. Imagine a customer service chatbot providing outdated product information, or a legal research assistant citing superseded regulations. The consequences range from user frustration to serious compliance issues.
Data freshness refers to how current and up-to-date your indexed data is compared to the source of truth. Data lifecycle management encompasses the entire journey of data from ingestion through updates, archival, and eventual deletion. Together, these concepts ensure your AI system remains accurate, relevant, and trustworthy over time.
Why does this matter? Consider these scenarios:
- π° News aggregation: Articles from last week might be worthless for breaking news queries
- π° E-commerce: Outdated pricing or inventory data leads to failed transactions
- π₯ Healthcare: Stale patient records could impact treatment decisions
- π Documentation: Users finding deprecated API references waste hours debugging
Core Concepts
Understanding Data Staleness π
Data staleness is the time gap between when source data changes and when those changes reflect in your search index. Every RAG system has some stalenessβthe question is whether it's acceptable for your use case.
| Use Case | Acceptable Staleness | Impact of Stale Data |
|---|---|---|
| Real-time stock trading | Milliseconds | π΄ Critical - financial losses |
| Social media feeds | Seconds to minutes | π‘ Moderate - user experience |
| Product catalogs | Hours | π‘ Moderate - conversion rates |
| Legal documents | Days | π Significant - compliance risk |
| Historical archives | Weeks to months | π’ Low - reference material |
Staleness metrics you should track:
- Index lag: Time between source update and index availability
- Query staleness: Age of the oldest document returned in results
- Coverage freshness: Percentage of documents updated within SLA
π‘ Pro tip: Don't just measure average stalenessβtrack the 99th percentile. A few extremely stale documents can severely impact user trust.
Time-To-Live (TTL) Strategies β°
TTL defines how long data remains valid before requiring refresh or removal. Think of it as an expiration date for your indexed content.
βββββββββββββββββββββββββββββββββββββββββββββββ β TTL LIFECYCLE β βββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π₯ Ingest β β Valid β β οΈ Stale β ποΈ Delete β β t=0 tTTL+graceβ β β βββββββββββββββββββββββββββββββββββββββββββββββ
TTL implementation approaches:
Document-level TTL: Each document carries its own expiration timestamp
- Flexible for mixed content types
- Stored as metadata field:
expires_at: 2026-06-01T00:00:00Z - Index queries filter out expired docs automatically
Collection-level TTL: All documents in a collection share the same TTL
- Simpler to manage
- Good for uniform content (e.g., session data, cache entries)
- Example: "All chat history expires after 30 days"
Sliding window TTL: TTL resets on each access
- Keeps frequently used data fresh
- Automatically purges unused content
- Common in cache systems
Conditional TTL: Expiration depends on document properties
- Premium content: 90 days, Free content: 7 days
- Active users: no expiry, Inactive users: 180 days
Implementation example (pseudocode):
class Document:
def __init__(self, content, ttl_seconds):
self.content = content
self.created_at = datetime.now()
self.expires_at = self.created_at + timedelta(seconds=ttl_seconds)
def is_valid(self):
return datetime.now() < self.expires_at
def time_until_expiry(self):
return (self.expires_at - datetime.now()).total_seconds()
β οΈ Common pitfall: Setting TTL too aggressively can cause unnecessary reindexing load. Too conservative, and you serve stale data. Balance is key!
Incremental vs Full Reindexing π
When source data changes, you have two fundamental approaches to update your index:
Full reindexing:
- Rebuilds the entire index from scratch
- Guarantees consistency
- Resource-intensive and slow
- Suitable for: major schema changes, quarterly refreshes, small datasets
Incremental updates:
- Only processes changed documents
- Fast and efficient
- Requires change detection mechanism
- Suitable for: real-time systems, large datasets, frequent updates
| Aspect | Full Reindex | Incremental Update |
|---|---|---|
| Speed | Slow (hours/days) | Fast (seconds/minutes) |
| Complexity | Simple | Complex (change tracking) |
| Resource cost | High | Low |
| Consistency | Guaranteed | Eventual |
| Downtime | Possible | None (zero-downtime) |
Change detection mechanisms:
Timestamps: Track
last_modifiedfieldSELECT * FROM documents WHERE last_modified > '2026-05-01'- Simple and effective
- Requires source system to maintain timestamps
Version numbers: Monotonically increasing integers
SELECT * FROM documents WHERE version > :last_indexed_version- Atomic and reliable
- Good for database systems with built-in versioning
Change Data Capture (CDC): Stream of changes from database logs
- Real-time updates
- Captures all operations (insert/update/delete)
- Requires CDC infrastructure (Debezium, AWS DMS)
Checksums/Hashes: Compare content fingerprints
current_hash = hashlib.sha256(document.encode()).hexdigest() if current_hash != indexed_hash: update_index(document)- Detects any content change
- CPU-intensive for large documents
Event-driven updates: Source system publishes change events
- Push-based (no polling)
- Requires event streaming (Kafka, RabbitMQ)
π‘ Hybrid approach: Use incremental updates for routine changes, schedule full reindex weekly/monthly to catch any inconsistencies.
Update Propagation Patterns π
Update propagation is how changes flow from source systems through your data pipeline to the search index:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β UPDATE PROPAGATION PATTERNS β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. BATCH PROCESSING (scheduled intervals)
Source DB β [Wait] β Extract β Transform β Index
(hours) β β β
Queue ETL Bulk API
Latency: Hours to days
Cost: Low
Complexity: Simple
2. MICRO-BATCH (small frequent batches)
Source DB β [Wait] β Extract β Transform β Index
(minutes) β β β
Stream Lambda Batch
Latency: Minutes
Cost: Medium
Complexity: Medium
3. STREAMING (near real-time)
Source DB β CDC β Kafka β Processor β Index
β β β β
Log Stream Transform Single
Latency: Seconds
Cost: High
Complexity: High
4. SYNCHRONOUS (inline updates)
App β Database β Webhook β Index
(writes) β β
Trigger Update
Latency: Milliseconds
Cost: Highest
Complexity: Highest
Choosing the right pattern:
- Batch: Analytics, historical archives, non-critical updates
- Micro-batch: News feeds, social media, moderate freshness needs
- Streaming: Financial data, inventory, high-value real-time apps
- Synchronous: Critical operations, small scale, strong consistency
π§ Try this: Map your use cases to update patterns. If you have mixed requirements, implement multiple patterns for different data types.
Data Versioning & Rollback π
Data versioning maintains historical snapshots of your index, enabling rollback when issues arise.
Why version your index?
- π Bug recovery: Roll back if pipeline introduces corrupted data
- π¬ A/B testing: Compare search quality across index versions
- π Compliance: Maintain audit trail of data changes
- βͺ Safe deployments: Test new indexing logic against production snapshot
Versioning strategies:
Index aliasing: Point alias to active version
products_v1 β products (alias) products_v2 products_v3 # Switch traffic: products (alias) β products_v3- Zero-downtime switching
- Instant rollback
- Doubles storage temporarily
Snapshot-based: Periodic backups
daily_snapshot_2026_05_20 daily_snapshot_2026_05_21 daily_snapshot_2026_05_22- Point-in-time recovery
- Storage-efficient (incremental)
- Restore takes time
Document versioning: Store version history per document
{ "id": "doc123", "current_version": 5, "versions": [ {"v": 1, "content": "...", "timestamp": "..."}, {"v": 2, "content": "...", "timestamp": "..."}, {"v": 5, "content": "...", "timestamp": "..."} ] }- Granular history
- High storage cost
- Enables temporal queries ("What did this document say last month?")
Retention policies:
- Keep last N versions
- Retain versions for X days
- Compress old versions
- Archive to cold storage
β οΈ Warning: Versioning multiplies storage costs. Budget accordingly and implement aggressive retention policies.
Deletion & Archival Strategies ποΈ
Hard deletion removes data permanently. Soft deletion marks data as deleted but retains it. Archival moves data to cheaper storage.
| Strategy | Use Case | Pros | Cons |
|---|---|---|---|
| Hard delete | GDPR compliance, sensitive data | Frees storage, meets regulations | Irreversible, no recovery |
| Soft delete | User content, accidental deletion | Recoverable, audit trail | Index bloat, query complexity |
| Archival | Old data, compliance retention | Cost-efficient, accessible | Slower retrieval |
Soft deletion implementation:
{
"id": "doc456",
"content": "Some content...",
"deleted": true,
"deleted_at": "2026-05-15T10:30:00Z",
"deleted_by": "user789"
}
Query filtering:
// Exclude soft-deleted by default
query = {
bool: {
must: { match: { content: searchTerms } },
filter: { term: { deleted: false } }
}
}
Archival tiers (hot β warm β cold):
ββββββββββββββββββββββββββββββββββββββββββββββββ β DATA TEMPERATURE TIERS β ββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π₯ HOT (0-30 days) β β Fast SSDs, high IOPS β β $$$$ per GB β β Query latency: <50ms β β β β β π‘οΈ WARM (31-90 days) β β Standard storage, moderate IOPS β β $$ per GB β β Query latency: <500ms β β β β β βοΈ COLD (90+ days) β β Archive storage, low IOPS β β $ per GB β β Query latency: seconds β β β β β π§ GLACIER (1+ years) β β Deep archive, rare access β β Β’ per GB β β Retrieval time: hours β β β ββββββββββββββββββββββββββββββββββββββββββββββββ
Lifecycle automation:
lifecycle_policy:
- rule: move_to_warm
condition: age > 30_days
action: transition_to_warm_storage
- rule: move_to_cold
condition: age > 90_days AND access_count < 10
action: transition_to_cold_storage
- rule: archive
condition: age > 365_days
action: move_to_glacier
- rule: purge
condition: age > 2555_days # 7 years
action: permanent_delete
π‘ Cost optimization: Moving data from hot to cold storage can reduce costs by 90%+. Automate transitions based on access patterns.
Real-World Examples
Example 1: E-commerce Product Catalog π
Scenario: An online retailer with 10 million products needs to keep pricing and inventory accurate.
Challenges:
- Prices change thousands of times per hour (dynamic pricing)
- Inventory updates in real-time as orders process
- Product descriptions updated by merchants irregularly
- Seasonal products need automatic archival
Solution architecture:
Multi-tier update strategy:
- Critical data (price, stock): Streaming updates via CDC
- Semi-critical (ratings, reviews): Micro-batch every 5 minutes
- Static content (descriptions, specs): Batch overnight
TTL implementation:
{ "product_id": "SHOE-123", "price": 89.99, "price_ttl": 300, // 5 minutes for dynamic pricing "inventory": 47, "inventory_ttl": 60, // 1 minute - critical "description": "Comfortable running shoes...", "description_ttl": 86400 // 24 hours - rarely changes }Change detection:
- Database triggers on
priceandinventorycolumns - Events published to Kafka topic
- Stream processor updates index in < 2 seconds
- Database triggers on
Archival policy:
- Discontinued products β soft delete after 30 days
- Soft deleted β move to cold storage after 90 days
- Seasonal items (e.g., Halloween costumes) β auto-archive post-season
Results:
- 99.9% of searches return accurate pricing
- Reduced "out of stock" complaints by 85%
- Storage costs reduced 40% via archival
Example 2: News Aggregation Platform π°
Scenario: A news aggregator ingesting 100,000+ articles daily from various sources.
Challenges:
- Breaking news must appear instantly
- Old news loses value rapidly ("yesterday's news")
- Different content types age differently
- Storage costs for historical archive
Solution architecture:
Content-aware TTL:
def calculate_ttl(article): base_ttl = 7 * 24 * 3600 # 7 days if article.category == 'breaking': return 24 * 3600 # 1 day (high churn) elif article.category == 'analysis': return 30 * 24 * 3600 # 30 days (evergreen) elif article.category == 'sports': return 3 * 24 * 3600 # 3 days (event-driven) # Adjust based on engagement if article.views > 10000: return base_ttl * 2 # Popular content stays longer return base_ttlTime-decay scoring:
// Boost recent articles in search ranking score = relevance_score * time_decay_factor time_decay_factor = exp(-lambda * age_hours) // lambda = 0.1 means half-life of ~7 hoursTiered storage:
- Today (hot): Full-text search, fast retrieval
- This week (warm): Searchable, acceptable latency
- This month (cold): Compressed, slower retrieval
- Archive (glacier): Compliance only, no search
Automatic lifecycle:
- Age 0-24h: HOT tier, prominent in feeds - Age 1-7d: WARM tier, normal search results - Age 7-30d: COLD tier, "older results" section - Age 30d+: ARCHIVE tier, excluded from search (available via URL)
Results:
- Breaking news indexed in < 5 seconds
- 70% storage reduction vs. keeping all content hot
- User satisfaction up 25% (more relevant results)
Example 3: Enterprise Knowledge Base π
Scenario: A company wiki with 500,000 documents, frequent edits, and strict compliance requirements.
Challenges:
- Must maintain edit history for audit (7 years)
- Outdated documentation causes support tickets
- Different departments have different freshness needs
- GDPR requires ability to purge employee data
Solution architecture:
Document versioning:
{ "doc_id": "KB-4567", "title": "Password Reset Procedure", "current_version": 12, "content": "<current content>", "versions": [ {"v": 1, "author": "alice", "date": "2019-03-15", "hash": "abc123"}, {"v": 12, "author": "bob", "date": "2026-05-10", "hash": "def456"} ], "last_reviewed": "2026-05-10", "review_due": "2026-08-10" // Quarterly review }Freshness indicators:
- Flag documents not reviewed in 90 days with β οΈ warning
- Auto-email doc owners 30 days before review due
- Surface "potentially outdated" banner in search results
Departmental policies:
- Engineering docs: Review every 3 months (fast-moving)
- HR policies: Review annually (stable)
- Legal: Review only on regulation change
Compliance-ready deletion:
def gdpr_delete(user_id): # Find all documents authored/edited by user docs = index.query({"author": user_id}) for doc in docs: # Keep document structure, anonymize author doc.author = "[REDACTED]" doc.author_email = None # Remove from version history for version in doc.versions: if version.author == user_id: version.author = "[REDACTED]" index.update(doc) # Hard delete personal profile index.delete({"type": "user_profile", "id": user_id})Incremental updates with validation:
- Authors edit via web interface
- On save, webhook triggers index update
- Validation rules check:
- No broken internal links
- Required metadata present
- Proper approval for policy docs
- Failed validation β don't index, notify author
Results:
- Support tickets for "incorrect documentation" down 60%
- 100% compliance with data retention policies
- Average document age reduced from 18 months to 4 months
Example 4: Medical Research Database π₯
Scenario: A RAG system answering questions from published medical literature (millions of papers).
Challenges:
- Papers are sometimes retracted (must remove immediately)
- Preprints vs. peer-reviewed have different authority levels
- New research can contradict older findings
- Citations and impact evolve over time
Solution architecture:
Multi-version indexing:
{ "paper_id": "doi:10.1234/example", "versions": [ { "version": "preprint_v1", "status": "preprint", "date": "2025-01-15", "indexed": true, "boost": 0.5 // Lower ranking for preprints }, { "version": "published", "status": "peer_reviewed", "date": "2025-06-20", "indexed": true, "boost": 1.0 } ], "retracted": false, "citation_count": 47, "last_citation_update": "2026-05-15" }Retraction handling:
def handle_retraction(paper_id, retraction_notice): # Immediate hard delete from active index index.delete(paper_id) # Move to "retracted" collection with notice retracted_index.add({ "original_id": paper_id, "retraction_date": datetime.now(), "retraction_reason": retraction_notice, "original_content": get_snapshot(paper_id) }) # Update citing papers with warning citing_papers = find_citations_to(paper_id) for citing in citing_papers: citing.add_warning(f"References retracted paper {paper_id}") index.update(citing)Dynamic relevance scoring:
- Recent papers get recency boost (last 2 years)
- High citation count increases authority
- Peer-reviewed > preprints
- Retractions immediately zeroed out
Citation freshness:
- Weekly batch job updates citation counts from external APIs
- Triggers reindexing of papers with significant citation change (>20%)
- Enables "trending" queries (papers gaining attention)
Compliance and ethics:
- Patient data: immediate hard delete on request (HIPAA)
- Author opt-out: soft delete + anonymize
- Clinical trial data: retain for legal minimum (10 years)
Results:
- Zero incidents of citing retracted papers
- 95% of results from last 5 years (appropriate for fast-moving field)
- Compliance with medical data regulations
Common Mistakes to Avoid β οΈ
1. Ignoring Partial Update Failures
β Wrong approach: Assume all updates succeed
for doc in updated_docs:
index.update(doc) # What if this fails halfway?
β Correct approach: Track failures and implement retry logic
failed_updates = []
for doc in updated_docs:
try:
index.update(doc)
except Exception as e:
failed_updates.append({"doc": doc, "error": e})
log_failure(doc.id, e)
if failed_updates:
retry_with_backoff(failed_updates)
Why it matters: A partial batch failure can leave your index in an inconsistent state. Users see some updated content, some staleβundermining trust.
2. Not Accounting for Clock Skew
β Wrong approach: Compare timestamps across distributed systems
if source_timestamp > indexed_timestamp:
update_index()
β Correct approach: Use version numbers or vector clocks
if source_version > indexed_version:
update_index()
Why it matters: Servers have clock drift. A source server 5 minutes ahead could make fresh data appear stale, causing unnecessary reindexing.
3. Aggressive TTL Without Grace Period
β Wrong approach: Hard cutoff at TTL
if age > ttl:
delete_from_index(doc)
β Correct approach: Grace period + soft warning
if age > ttl:
doc.mark_as_stale()
doc.reduce_ranking_boost()
if age > ttl + grace_period:
delete_from_index(doc)
Why it matters: If your refresh pipeline has a hiccup, aggressive TTL could delete all your data. Grace periods prevent catastrophic data loss.
4. Forgetting to Propagate Deletes
β Wrong approach: Only sync additions and updates
for doc in get_changed_docs():
if doc.exists():
index.upsert(doc)
# Deletes never reach the index!
β Correct approach: Explicitly handle deletions
for change in get_change_stream():
if change.type == 'INSERT' or change.type == 'UPDATE':
index.upsert(change.doc)
elif change.type == 'DELETE':
index.delete(change.doc_id)
Why it matters: Deleted source documents lingering in search results frustrate users and waste compute resources.
5. No Monitoring of Data Freshness
β Wrong approach: Assume pipeline is working
β Correct approach: Active monitoring and alerting
metrics = {
"avg_index_lag": calculate_avg_lag(),
"p99_staleness": get_percentile(99, doc_ages),
"failed_updates_count": count_failures_last_hour(),
"oldest_document_age": max(doc_ages)
}
if metrics["p99_staleness"] > SLA_THRESHOLD:
alert_oncall("Data freshness SLA breach")
Why it matters: You can't fix what you don't measure. Silent failures mean users get stale data without anyone knowing.
6. Overindexing Volatile Data
β Wrong approach: Reindex on every tiny change
## Stock price updates every 100ms
for price_update in price_stream:
index.update_document(ticker, new_price)
# Thrashing the index!
β Correct approach: Batch rapid updates or use cache
## Update index every 5 seconds with latest price
price_buffer = {}
def buffer_price_update(ticker, price):
price_buffer[ticker] = price
scheduled_every(5_seconds):
index.bulk_update(price_buffer)
price_buffer.clear()
Why it matters: Excessive indexing wastes CPU/IO and can actually increase query latency due to lock contention.
7. Inconsistent TTL Across Pipelines
β Wrong approach: Different TTL settings in different systems
## Cache layer
ttl: 3600 # 1 hour
## Index layer
ttl: 7200 # 2 hours
## API response cache
ttl: 1800 # 30 minutes
β Correct approach: Centralized TTL configuration
## config.yaml - single source of truth
ttl_policies:
product_data: 3600
user_profiles: 7200
session_data: 1800
## All systems read from same config
Why it matters: Inconsistent TTLs create confusionβusers see different data freshness depending on which path serves their request.
Key Takeaways π―
Data freshness is a spectrum, not binary. Define acceptable staleness for each data type based on business requirements.
TTL policies prevent index bloat and ensure users see relevant content. Implement document-level TTL for flexibility.
Incremental updates are essential for large-scale systems. Use CDC, timestamps, or version numbers to track changes efficiently.
Update propagation latency ranges from milliseconds (synchronous) to days (batch). Choose based on freshness requirements and resource constraints.
Version your indexes to enable safe rollbacks and A/B testing. Use index aliasing for zero-downtime switches.
Implement tiered storage (hot/warm/cold) to optimize costs. Automate lifecycle transitions based on age and access patterns.
Soft deletes provide safety nets for accidental deletions. Hard deletes are for compliance (GDPR) and sensitive data.
Monitor freshness metrics actively: index lag, staleness percentiles, failed updates. Alert when SLAs breach.
Plan for partial failures. Distributed systems fail in creative waysβimplement retries, idempotency, and dead-letter queues.
Balance freshness and cost. Real-time updates are expensive. Not every data type needs millisecond freshness.
π Quick Reference Card: Data Freshness Checklist
| Define Staleness SLAs | Set acceptable lag for each data type (seconds/hours/days) |
| Implement Change Detection | Timestamps, versions, CDC, or checksums |
| Choose Update Pattern | Batch, micro-batch, streaming, or synchronous |
| Set TTL Policies | Document-level expiration based on content type |
| Enable Versioning | Index aliasing or snapshots for rollback capability |
| Configure Archival | Hot β warm β cold β glacier transitions |
| Handle Deletes | Soft delete by default, hard delete for compliance |
| Monitor Metrics | Index lag, staleness p99, failed updates, oldest doc age |
| Plan for Failures | Retries, dead-letter queues, alerting |
| Test Rollback | Regularly verify you can restore from backups/versions |
π Further Study
- Elasticsearch Data Lifecycle Management: https://www.elastic.co/guide/en/elasticsearch/reference/current/data-lifecycle.html
- Change Data Capture Patterns: https://debezium.io/documentation/reference/architecture.html
- AWS Data Lifecycle Best Practices: https://aws.amazon.com/blogs/storage/optimizing-your-storage-costs-with-amazon-s3/
π§ Mnemonic for update patterns: "Bob Makes Silly Sandwiches" = Batch, Micro-batch, Streaming, Synchronous
π Next steps: Practice implementing a TTL policy in your chosen vector database. Start with a simple document-level expiration field and expand to automated archival as you gain confidence.