Data Freshness & Lifecycle

Re-embedding strategies TTLs Stale chunk detection

Data Freshness & Lifecycle

Master data freshness and lifecycle management with free flashcards and spaced repetition practice. This lesson covers data staleness detection, incremental updates, and Time-To-Live (TTL) strategies—essential concepts for building production-grade AI search and RAG systems that deliver accurate, timely results.

Welcome to Data Freshness & Lifecycle 📊

In modern AI search and RAG (Retrieval-Augmented Generation) systems, data freshness isn't just a nice-to-have feature—it's a critical requirement. Imagine a customer service chatbot providing outdated product information, or a legal research assistant citing superseded regulations. The consequences range from user frustration to serious compliance issues.

Data freshness refers to how current and up-to-date your indexed data is compared to the source of truth. Data lifecycle management encompasses the entire journey of data from ingestion through updates, archival, and eventual deletion. Together, these concepts ensure your AI system remains accurate, relevant, and trustworthy over time.

Why does this matter? Consider these scenarios:

📰 News aggregation: Articles from last week might be worthless for breaking news queries
💰 E-commerce: Outdated pricing or inventory data leads to failed transactions
🏥 Healthcare: Stale patient records could impact treatment decisions
📚 Documentation: Users finding deprecated API references waste hours debugging

Core Concepts

Understanding Data Staleness 🕐

Data staleness is the time gap between when source data changes and when those changes reflect in your search index. Every RAG system has some staleness—the question is whether it's acceptable for your use case.

Use Case	Acceptable Staleness	Impact of Stale Data
Real-time stock trading	Milliseconds	🔴 Critical - financial losses
Social media feeds	Seconds to minutes	🟡 Moderate - user experience
Product catalogs	Hours	🟡 Moderate - conversion rates
Legal documents	Days	🟠 Significant - compliance risk
Historical archives	Weeks to months	🟢 Low - reference material

Staleness metrics you should track:

Index lag: Time between source update and index availability
Query staleness: Age of the oldest document returned in results
Coverage freshness: Percentage of documents updated within SLA

💡 Pro tip: Don't just measure average staleness—track the 99th percentile. A few extremely stale documents can severely impact user trust.

Time-To-Live (TTL) Strategies ⏰

TTL defines how long data remains valid before requiring refresh or removal. Think of it as an expiration date for your indexed content.

┌─────────────────────────────────────────────┐
│          TTL LIFECYCLE                      │
├─────────────────────────────────────────────┤
│                                             │
│  📥 Ingest → ✅ Valid → ⚠️ Stale → 🗑️ Delete │
│     t=0      tTTL+grace│
│                                             │
└─────────────────────────────────────────────┘

TTL implementation approaches:

Document-level TTL: Each document carries its own expiration timestamp
- Flexible for mixed content types
- Stored as metadata field: expires_at: 2026-06-01T00:00:00Z
- Index queries filter out expired docs automatically
Collection-level TTL: All documents in a collection share the same TTL
- Simpler to manage
- Good for uniform content (e.g., session data, cache entries)
- Example: "All chat history expires after 30 days"
Sliding window TTL: TTL resets on each access
- Keeps frequently used data fresh
- Automatically purges unused content
- Common in cache systems
Conditional TTL: Expiration depends on document properties
- Premium content: 90 days, Free content: 7 days
- Active users: no expiry, Inactive users: 180 days

Implementation example (pseudocode):

class Document:
    def __init__(self, content, ttl_seconds):
        self.content = content
        self.created_at = datetime.now()
        self.expires_at = self.created_at + timedelta(seconds=ttl_seconds)
    
    def is_valid(self):
        return datetime.now() < self.expires_at
    
    def time_until_expiry(self):
        return (self.expires_at - datetime.now()).total_seconds()

⚠️ Common pitfall: Setting TTL too aggressively can cause unnecessary reindexing load. Too conservative, and you serve stale data. Balance is key!

Incremental vs Full Reindexing 🔄

When source data changes, you have two fundamental approaches to update your index:

Full reindexing:

Rebuilds the entire index from scratch
Guarantees consistency
Resource-intensive and slow
Suitable for: major schema changes, quarterly refreshes, small datasets

Incremental updates:

Only processes changed documents
Fast and efficient
Requires change detection mechanism
Suitable for: real-time systems, large datasets, frequent updates

Aspect	Full Reindex	Incremental Update
Speed	Slow (hours/days)	Fast (seconds/minutes)
Complexity	Simple	Complex (change tracking)
Resource cost	High	Low
Consistency	Guaranteed	Eventual
Downtime	Possible	None (zero-downtime)

Change detection mechanisms:

Timestamps: Track last_modified field
```
SELECT * FROM documents WHERE last_modified > '2026-05-01'
```
- Simple and effective
- Requires source system to maintain timestamps
Version numbers: Monotonically increasing integers
```
SELECT * FROM documents WHERE version > :last_indexed_version
```
- Atomic and reliable
- Good for database systems with built-in versioning
Change Data Capture (CDC): Stream of changes from database logs
- Real-time updates
- Captures all operations (insert/update/delete)
- Requires CDC infrastructure (Debezium, AWS DMS)

Checksums/Hashes: Compare content fingerprints

current_hash = hashlib.sha256(document.encode()).hexdigest()
if current_hash != indexed_hash:
    update_index(document)

Detects any content change
CPU-intensive for large documents

Event-driven updates: Source system publishes change events
- Push-based (no polling)
- Requires event streaming (Kafka, RabbitMQ)

💡 Hybrid approach: Use incremental updates for routine changes, schedule full reindex weekly/monthly to catch any inconsistencies.

Update Propagation Patterns 🌊

Update propagation is how changes flow from source systems through your data pipeline to the search index:

┌────────────────────────────────────────────────────┐
│         UPDATE PROPAGATION PATTERNS                │
└────────────────────────────────────────────────────┘

1. BATCH PROCESSING (scheduled intervals)
   Source DB → [Wait] → Extract → Transform → Index
              (hours)     ↓         ↓         ↓
                        Queue     ETL      Bulk API
   Latency: Hours to days
   Cost: Low
   Complexity: Simple

2. MICRO-BATCH (small frequent batches)
   Source DB → [Wait] → Extract → Transform → Index
              (minutes)   ↓         ↓         ↓
                        Stream    Lambda    Batch
   Latency: Minutes
   Cost: Medium
   Complexity: Medium

3. STREAMING (near real-time)
   Source DB → CDC → Kafka → Processor → Index
                ↓      ↓        ↓         ↓
              Log   Stream   Transform  Single
   Latency: Seconds
   Cost: High
   Complexity: High

4. SYNCHRONOUS (inline updates)
   App → Database → Webhook → Index
         (writes)     ↓         ↓
                   Trigger   Update
   Latency: Milliseconds
   Cost: Highest
   Complexity: Highest

Choosing the right pattern:

Batch: Analytics, historical archives, non-critical updates
Micro-batch: News feeds, social media, moderate freshness needs
Streaming: Financial data, inventory, high-value real-time apps
Synchronous: Critical operations, small scale, strong consistency

🔧 Try this: Map your use cases to update patterns. If you have mixed requirements, implement multiple patterns for different data types.

Data Versioning & Rollback 🔙

Data versioning maintains historical snapshots of your index, enabling rollback when issues arise.

Why version your index?

🐛 Bug recovery: Roll back if pipeline introduces corrupted data
🔬 A/B testing: Compare search quality across index versions
📊 Compliance: Maintain audit trail of data changes
⏪ Safe deployments: Test new indexing logic against production snapshot

Versioning strategies:

Index aliasing: Point alias to active version
```
products_v1 ← products (alias)
products_v2
products_v3

# Switch traffic:
products (alias) → products_v3
```
- Zero-downtime switching
- Instant rollback
- Doubles storage temporarily
Snapshot-based: Periodic backups
```
daily_snapshot_2026_05_20
daily_snapshot_2026_05_21
daily_snapshot_2026_05_22
```
- Point-in-time recovery
- Storage-efficient (incremental)
- Restore takes time

Document versioning: Store version history per document

{
  "id": "doc123",
  "current_version": 5,
  "versions": [
    {"v": 1, "content": "...", "timestamp": "..."},
    {"v": 2, "content": "...", "timestamp": "..."},
    {"v": 5, "content": "...", "timestamp": "..."}
  ]
}

Granular history
High storage cost
Enables temporal queries ("What did this document say last month?")

Retention policies:

Keep last N versions
Retain versions for X days
Compress old versions
Archive to cold storage

⚠️ Warning: Versioning multiplies storage costs. Budget accordingly and implement aggressive retention policies.

Deletion & Archival Strategies 🗄️

Hard deletion removes data permanently. Soft deletion marks data as deleted but retains it. Archival moves data to cheaper storage.

Strategy	Use Case	Pros	Cons
Hard delete	GDPR compliance, sensitive data	Frees storage, meets regulations	Irreversible, no recovery
Soft delete	User content, accidental deletion	Recoverable, audit trail	Index bloat, query complexity
Archival	Old data, compliance retention	Cost-efficient, accessible	Slower retrieval

Soft deletion implementation:

{
  "id": "doc456",
  "content": "Some content...",
  "deleted": true,
  "deleted_at": "2026-05-15T10:30:00Z",
  "deleted_by": "user789"
}

Query filtering:

// Exclude soft-deleted by default
query = {
  bool: {
    must: { match: { content: searchTerms } },
    filter: { term: { deleted: false } }
  }
}

Archival tiers (hot → warm → cold):

┌──────────────────────────────────────────────┐
│  DATA TEMPERATURE TIERS                      │
├──────────────────────────────────────────────┤
│                                              │
│  🔥 HOT (0-30 days)                          │
│  Fast SSDs, high IOPS                        │
│  $$$$ per GB                                 │
│  Query latency: <50ms                        │
│            ↓                                 │
│  🌡️ WARM (31-90 days)                        │
│  Standard storage, moderate IOPS             │
│  $$ per GB                                   │
│  Query latency: <500ms                       │
│            ↓                                 │
│  ❄️ COLD (90+ days)                          │
│  Archive storage, low IOPS                   │
│  $ per GB                                    │
│  Query latency: seconds                      │
│            ↓                                 │
│  🧊 GLACIER (1+ years)                       │
│  Deep archive, rare access                   │
│  ¢ per GB                                    │
│  Retrieval time: hours                       │
│                                              │
└──────────────────────────────────────────────┘

Lifecycle automation:

lifecycle_policy:
  - rule: move_to_warm
    condition: age > 30_days
    action: transition_to_warm_storage
  
  - rule: move_to_cold
    condition: age > 90_days AND access_count < 10
    action: transition_to_cold_storage
  
  - rule: archive
    condition: age > 365_days
    action: move_to_glacier
  
  - rule: purge
    condition: age > 2555_days  # 7 years
    action: permanent_delete

💡 Cost optimization: Moving data from hot to cold storage can reduce costs by 90%+. Automate transitions based on access patterns.

Real-World Examples

Example 1: E-commerce Product Catalog 🛒

Scenario: An online retailer with 10 million products needs to keep pricing and inventory accurate.

Challenges:

Prices change thousands of times per hour (dynamic pricing)
Inventory updates in real-time as orders process
Product descriptions updated by merchants irregularly
Seasonal products need automatic archival

Solution architecture:

Multi-tier update strategy:
- Critical data (price, stock): Streaming updates via CDC
- Semi-critical (ratings, reviews): Micro-batch every 5 minutes
- Static content (descriptions, specs): Batch overnight

TTL implementation:

{
  "product_id": "SHOE-123",
  "price": 89.99,
  "price_ttl": 300,  // 5 minutes for dynamic pricing
  "inventory": 47,
  "inventory_ttl": 60,  // 1 minute - critical
  "description": "Comfortable running shoes...",
  "description_ttl": 86400  // 24 hours - rarely changes
}

Change detection:
- Database triggers on price and inventory columns
- Events published to Kafka topic
- Stream processor updates index in < 2 seconds
Archival policy:
- Discontinued products → soft delete after 30 days
- Soft deleted → move to cold storage after 90 days
- Seasonal items (e.g., Halloween costumes) → auto-archive post-season

Results:

99.9% of searches return accurate pricing
Reduced "out of stock" complaints by 85%
Storage costs reduced 40% via archival

Example 2: News Aggregation Platform 📰

Scenario: A news aggregator ingesting 100,000+ articles daily from various sources.

Challenges:

Breaking news must appear instantly
Old news loses value rapidly ("yesterday's news")
Different content types age differently
Storage costs for historical archive

Solution architecture:

Content-aware TTL:

def calculate_ttl(article):
    base_ttl = 7 * 24 * 3600  # 7 days

    if article.category == 'breaking':
        return 24 * 3600  # 1 day (high churn)
    elif article.category == 'analysis':
        return 30 * 24 * 3600  # 30 days (evergreen)
    elif article.category == 'sports':
        return 3 * 24 * 3600  # 3 days (event-driven)

    # Adjust based on engagement
    if article.views > 10000:
        return base_ttl * 2  # Popular content stays longer

    return base_ttl

Time-decay scoring:

// Boost recent articles in search ranking
score = relevance_score * time_decay_factor

time_decay_factor = exp(-lambda * age_hours)
// lambda = 0.1 means half-life of ~7 hours

Tiered storage:
- Today (hot): Full-text search, fast retrieval
- This week (warm): Searchable, acceptable latency
- This month (cold): Compressed, slower retrieval
- Archive (glacier): Compliance only, no search

Automatic lifecycle:

- Age 0-24h: HOT tier, prominent in feeds
- Age 1-7d: WARM tier, normal search results
- Age 7-30d: COLD tier, "older results" section
- Age 30d+: ARCHIVE tier, excluded from search (available via URL)

Results:

Breaking news indexed in < 5 seconds
70% storage reduction vs. keeping all content hot
User satisfaction up 25% (more relevant results)

Example 3: Enterprise Knowledge Base 📚

Scenario: A company wiki with 500,000 documents, frequent edits, and strict compliance requirements.

Challenges:

Must maintain edit history for audit (7 years)
Outdated documentation causes support tickets
Different departments have different freshness needs
GDPR requires ability to purge employee data

Solution architecture:

Document versioning:

{
  "doc_id": "KB-4567",
  "title": "Password Reset Procedure",
  "current_version": 12,
  "content": "<current content>",
  "versions": [
    {"v": 1, "author": "alice", "date": "2019-03-15", "hash": "abc123"},
    {"v": 12, "author": "bob", "date": "2026-05-10", "hash": "def456"}
  ],
  "last_reviewed": "2026-05-10",
  "review_due": "2026-08-10"  // Quarterly review
}

Freshness indicators:
- Flag documents not reviewed in 90 days with ⚠️ warning
- Auto-email doc owners 30 days before review due
- Surface "potentially outdated" banner in search results
Departmental policies:
- Engineering docs: Review every 3 months (fast-moving)
- HR policies: Review annually (stable)
- Legal: Review only on regulation change

Compliance-ready deletion:

def gdpr_delete(user_id):
    # Find all documents authored/edited by user
    docs = index.query({"author": user_id})

    for doc in docs:
        # Keep document structure, anonymize author
        doc.author = "[REDACTED]"
        doc.author_email = None

        # Remove from version history
        for version in doc.versions:
            if version.author == user_id:
                version.author = "[REDACTED]"

        index.update(doc)

    # Hard delete personal profile
    index.delete({"type": "user_profile", "id": user_id})

Incremental updates with validation:
- Authors edit via web interface
- On save, webhook triggers index update
- Validation rules check:
  - No broken internal links
  - Required metadata present
  - Proper approval for policy docs
- Failed validation → don't index, notify author

Results:

Support tickets for "incorrect documentation" down 60%
100% compliance with data retention policies
Average document age reduced from 18 months to 4 months

Example 4: Medical Research Database 🏥

Scenario: A RAG system answering questions from published medical literature (millions of papers).

Challenges:

Papers are sometimes retracted (must remove immediately)
Preprints vs. peer-reviewed have different authority levels
New research can contradict older findings
Citations and impact evolve over time

Solution architecture:

Multi-version indexing:

{
  "paper_id": "doi:10.1234/example",
  "versions": [
    {
      "version": "preprint_v1",
      "status": "preprint",
      "date": "2025-01-15",
      "indexed": true,
      "boost": 0.5  // Lower ranking for preprints
    },
    {
      "version": "published",
      "status": "peer_reviewed",
      "date": "2025-06-20",
      "indexed": true,
      "boost": 1.0
    }
  ],
  "retracted": false,
  "citation_count": 47,
  "last_citation_update": "2026-05-15"
}

Retraction handling:

def handle_retraction(paper_id, retraction_notice):
    # Immediate hard delete from active index
    index.delete(paper_id)

    # Move to "retracted" collection with notice
    retracted_index.add({
        "original_id": paper_id,
        "retraction_date": datetime.now(),
        "retraction_reason": retraction_notice,
        "original_content": get_snapshot(paper_id)
    })

    # Update citing papers with warning
    citing_papers = find_citations_to(paper_id)
    for citing in citing_papers:
        citing.add_warning(f"References retracted paper {paper_id}")
        index.update(citing)

Dynamic relevance scoring:
- Recent papers get recency boost (last 2 years)
- High citation count increases authority
- Peer-reviewed > preprints
- Retractions immediately zeroed out
Citation freshness:
- Weekly batch job updates citation counts from external APIs
- Triggers reindexing of papers with significant citation change (>20%)
- Enables "trending" queries (papers gaining attention)
Compliance and ethics:
- Patient data: immediate hard delete on request (HIPAA)
- Author opt-out: soft delete + anonymize
- Clinical trial data: retain for legal minimum (10 years)

Results:

Zero incidents of citing retracted papers
95% of results from last 5 years (appropriate for fast-moving field)
Compliance with medical data regulations

Common Mistakes to Avoid ⚠️

1. Ignoring Partial Update Failures

❌ Wrong approach: Assume all updates succeed

for doc in updated_docs:
    index.update(doc)  # What if this fails halfway?

✅ Correct approach: Track failures and implement retry logic

failed_updates = []
for doc in updated_docs:
    try:
        index.update(doc)
    except Exception as e:
        failed_updates.append({"doc": doc, "error": e})
        log_failure(doc.id, e)

if failed_updates:
    retry_with_backoff(failed_updates)

Why it matters: A partial batch failure can leave your index in an inconsistent state. Users see some updated content, some stale—undermining trust.

2. Not Accounting for Clock Skew

❌ Wrong approach: Compare timestamps across distributed systems

if source_timestamp > indexed_timestamp:
    update_index()

✅ Correct approach: Use version numbers or vector clocks

if source_version > indexed_version:
    update_index()

Why it matters: Servers have clock drift. A source server 5 minutes ahead could make fresh data appear stale, causing unnecessary reindexing.

3. Aggressive TTL Without Grace Period

❌ Wrong approach: Hard cutoff at TTL

if age > ttl:
    delete_from_index(doc)

✅ Correct approach: Grace period + soft warning

if age > ttl:
    doc.mark_as_stale()
    doc.reduce_ranking_boost()
    
if age > ttl + grace_period:
    delete_from_index(doc)

Why it matters: If your refresh pipeline has a hiccup, aggressive TTL could delete all your data. Grace periods prevent catastrophic data loss.

4. Forgetting to Propagate Deletes

❌ Wrong approach: Only sync additions and updates

for doc in get_changed_docs():
    if doc.exists():
        index.upsert(doc)
    # Deletes never reach the index!

✅ Correct approach: Explicitly handle deletions

for change in get_change_stream():
    if change.type == 'INSERT' or change.type == 'UPDATE':
        index.upsert(change.doc)
    elif change.type == 'DELETE':
        index.delete(change.doc_id)

Why it matters: Deleted source documents lingering in search results frustrate users and waste compute resources.

5. No Monitoring of Data Freshness

❌ Wrong approach: Assume pipeline is working

✅ Correct approach: Active monitoring and alerting

metrics = {
    "avg_index_lag": calculate_avg_lag(),
    "p99_staleness": get_percentile(99, doc_ages),
    "failed_updates_count": count_failures_last_hour(),
    "oldest_document_age": max(doc_ages)
}

if metrics["p99_staleness"] > SLA_THRESHOLD:
    alert_oncall("Data freshness SLA breach")

Why it matters: You can't fix what you don't measure. Silent failures mean users get stale data without anyone knowing.

6. Overindexing Volatile Data

❌ Wrong approach: Reindex on every tiny change

## Stock price updates every 100ms
for price_update in price_stream:
    index.update_document(ticker, new_price)
    # Thrashing the index!

✅ Correct approach: Batch rapid updates or use cache

## Update index every 5 seconds with latest price
price_buffer = {}

def buffer_price_update(ticker, price):
    price_buffer[ticker] = price

scheduled_every(5_seconds):
    index.bulk_update(price_buffer)
    price_buffer.clear()

Why it matters: Excessive indexing wastes CPU/IO and can actually increase query latency due to lock contention.

7. Inconsistent TTL Across Pipelines

❌ Wrong approach: Different TTL settings in different systems

## Cache layer
ttl: 3600  # 1 hour

## Index layer  
ttl: 7200  # 2 hours

## API response cache
ttl: 1800  # 30 minutes

✅ Correct approach: Centralized TTL configuration

## config.yaml - single source of truth
ttl_policies:
  product_data: 3600
  user_profiles: 7200
  session_data: 1800

## All systems read from same config

Why it matters: Inconsistent TTLs create confusion—users see different data freshness depending on which path serves their request.

Key Takeaways 🎯

Data freshness is a spectrum, not binary. Define acceptable staleness for each data type based on business requirements.
TTL policies prevent index bloat and ensure users see relevant content. Implement document-level TTL for flexibility.
Incremental updates are essential for large-scale systems. Use CDC, timestamps, or version numbers to track changes efficiently.
Update propagation latency ranges from milliseconds (synchronous) to days (batch). Choose based on freshness requirements and resource constraints.
Version your indexes to enable safe rollbacks and A/B testing. Use index aliasing for zero-downtime switches.
Implement tiered storage (hot/warm/cold) to optimize costs. Automate lifecycle transitions based on age and access patterns.
Soft deletes provide safety nets for accidental deletions. Hard deletes are for compliance (GDPR) and sensitive data.
Monitor freshness metrics actively: index lag, staleness percentiles, failed updates. Alert when SLAs breach.
Plan for partial failures. Distributed systems fail in creative ways—implement retries, idempotency, and dead-letter queues.
Balance freshness and cost. Real-time updates are expensive. Not every data type needs millisecond freshness.

📋 Quick Reference Card: Data Freshness Checklist

Define Staleness SLAs	Set acceptable lag for each data type (seconds/hours/days)
Implement Change Detection	Timestamps, versions, CDC, or checksums
Choose Update Pattern	Batch, micro-batch, streaming, or synchronous
Set TTL Policies	Document-level expiration based on content type
Enable Versioning	Index aliasing or snapshots for rollback capability
Configure Archival	Hot → warm → cold → glacier transitions
Handle Deletes	Soft delete by default, hard delete for compliance
Monitor Metrics	Index lag, staleness p99, failed updates, oldest doc age
Plan for Failures	Retries, dead-letter queues, alerting
Test Rollback	Regularly verify you can restore from backups/versions

📚 Further Study

Elasticsearch Data Lifecycle Management: https://www.elastic.co/guide/en/elasticsearch/reference/current/data-lifecycle.html
Change Data Capture Patterns: https://debezium.io/documentation/reference/architecture.html
AWS Data Lifecycle Best Practices: https://aws.amazon.com/blogs/storage/optimizing-your-storage-costs-with-amazon-s3/

🧠 Mnemonic for update patterns: "Bob Makes Silly Sandwiches" = Batch, Micro-batch, Streaming, Synchronous

🎓 Next steps: Practice implementing a TTL policy in your chosen vector database. Start with a simple document-level expiration field and expand to automated archival as you gain confidence.

📝

Ready to practice?

This lesson has 15 questions to help you learn