You are viewing a preview of this lesson. Sign in to start learning
Back to 2026 Modern AI Search & RAG Roadmap

Data Freshness & Lifecycle

Re-embedding strategies TTLs Stale chunk detection

Data Freshness & Lifecycle

Master data freshness and lifecycle management with free flashcards and spaced repetition practice. This lesson covers data staleness detection, incremental updates, and Time-To-Live (TTL) strategiesβ€”essential concepts for building production-grade AI search and RAG systems that deliver accurate, timely results.

Welcome to Data Freshness & Lifecycle πŸ“Š

In modern AI search and RAG (Retrieval-Augmented Generation) systems, data freshness isn't just a nice-to-have featureβ€”it's a critical requirement. Imagine a customer service chatbot providing outdated product information, or a legal research assistant citing superseded regulations. The consequences range from user frustration to serious compliance issues.

Data freshness refers to how current and up-to-date your indexed data is compared to the source of truth. Data lifecycle management encompasses the entire journey of data from ingestion through updates, archival, and eventual deletion. Together, these concepts ensure your AI system remains accurate, relevant, and trustworthy over time.

Why does this matter? Consider these scenarios:

  • πŸ“° News aggregation: Articles from last week might be worthless for breaking news queries
  • πŸ’° E-commerce: Outdated pricing or inventory data leads to failed transactions
  • πŸ₯ Healthcare: Stale patient records could impact treatment decisions
  • πŸ“š Documentation: Users finding deprecated API references waste hours debugging

Core Concepts

Understanding Data Staleness πŸ•

Data staleness is the time gap between when source data changes and when those changes reflect in your search index. Every RAG system has some stalenessβ€”the question is whether it's acceptable for your use case.

Use Case Acceptable Staleness Impact of Stale Data
Real-time stock trading Milliseconds πŸ”΄ Critical - financial losses
Social media feeds Seconds to minutes 🟑 Moderate - user experience
Product catalogs Hours 🟑 Moderate - conversion rates
Legal documents Days 🟠 Significant - compliance risk
Historical archives Weeks to months 🟒 Low - reference material

Staleness metrics you should track:

  • Index lag: Time between source update and index availability
  • Query staleness: Age of the oldest document returned in results
  • Coverage freshness: Percentage of documents updated within SLA

πŸ’‘ Pro tip: Don't just measure average stalenessβ€”track the 99th percentile. A few extremely stale documents can severely impact user trust.

Time-To-Live (TTL) Strategies ⏰

TTL defines how long data remains valid before requiring refresh or removal. Think of it as an expiration date for your indexed content.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          TTL LIFECYCLE                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                             β”‚
β”‚  πŸ“₯ Ingest β†’ βœ… Valid β†’ ⚠️ Stale β†’ πŸ—‘οΈ Delete β”‚
β”‚     t=0      tTTL+graceβ”‚
β”‚                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

TTL implementation approaches:

  1. Document-level TTL: Each document carries its own expiration timestamp

    • Flexible for mixed content types
    • Stored as metadata field: expires_at: 2026-06-01T00:00:00Z
    • Index queries filter out expired docs automatically
  2. Collection-level TTL: All documents in a collection share the same TTL

    • Simpler to manage
    • Good for uniform content (e.g., session data, cache entries)
    • Example: "All chat history expires after 30 days"
  3. Sliding window TTL: TTL resets on each access

    • Keeps frequently used data fresh
    • Automatically purges unused content
    • Common in cache systems
  4. Conditional TTL: Expiration depends on document properties

    • Premium content: 90 days, Free content: 7 days
    • Active users: no expiry, Inactive users: 180 days

Implementation example (pseudocode):

class Document:
    def __init__(self, content, ttl_seconds):
        self.content = content
        self.created_at = datetime.now()
        self.expires_at = self.created_at + timedelta(seconds=ttl_seconds)
    
    def is_valid(self):
        return datetime.now() < self.expires_at
    
    def time_until_expiry(self):
        return (self.expires_at - datetime.now()).total_seconds()

⚠️ Common pitfall: Setting TTL too aggressively can cause unnecessary reindexing load. Too conservative, and you serve stale data. Balance is key!

Incremental vs Full Reindexing πŸ”„

When source data changes, you have two fundamental approaches to update your index:

Full reindexing:

  • Rebuilds the entire index from scratch
  • Guarantees consistency
  • Resource-intensive and slow
  • Suitable for: major schema changes, quarterly refreshes, small datasets

Incremental updates:

  • Only processes changed documents
  • Fast and efficient
  • Requires change detection mechanism
  • Suitable for: real-time systems, large datasets, frequent updates
Aspect Full Reindex Incremental Update
Speed Slow (hours/days) Fast (seconds/minutes)
Complexity Simple Complex (change tracking)
Resource cost High Low
Consistency Guaranteed Eventual
Downtime Possible None (zero-downtime)

Change detection mechanisms:

  1. Timestamps: Track last_modified field

    SELECT * FROM documents WHERE last_modified > '2026-05-01'
    
    • Simple and effective
    • Requires source system to maintain timestamps
  2. Version numbers: Monotonically increasing integers

    SELECT * FROM documents WHERE version > :last_indexed_version
    
    • Atomic and reliable
    • Good for database systems with built-in versioning
  3. Change Data Capture (CDC): Stream of changes from database logs

    • Real-time updates
    • Captures all operations (insert/update/delete)
    • Requires CDC infrastructure (Debezium, AWS DMS)
  4. Checksums/Hashes: Compare content fingerprints

    current_hash = hashlib.sha256(document.encode()).hexdigest()
    if current_hash != indexed_hash:
        update_index(document)
    
    • Detects any content change
    • CPU-intensive for large documents
  5. Event-driven updates: Source system publishes change events

    • Push-based (no polling)
    • Requires event streaming (Kafka, RabbitMQ)

πŸ’‘ Hybrid approach: Use incremental updates for routine changes, schedule full reindex weekly/monthly to catch any inconsistencies.

Update Propagation Patterns 🌊

Update propagation is how changes flow from source systems through your data pipeline to the search index:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         UPDATE PROPAGATION PATTERNS                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. BATCH PROCESSING (scheduled intervals)
   Source DB β†’ [Wait] β†’ Extract β†’ Transform β†’ Index
              (hours)     ↓         ↓         ↓
                        Queue     ETL      Bulk API
   Latency: Hours to days
   Cost: Low
   Complexity: Simple

2. MICRO-BATCH (small frequent batches)
   Source DB β†’ [Wait] β†’ Extract β†’ Transform β†’ Index
              (minutes)   ↓         ↓         ↓
                        Stream    Lambda    Batch
   Latency: Minutes
   Cost: Medium
   Complexity: Medium

3. STREAMING (near real-time)
   Source DB β†’ CDC β†’ Kafka β†’ Processor β†’ Index
                ↓      ↓        ↓         ↓
              Log   Stream   Transform  Single
   Latency: Seconds
   Cost: High
   Complexity: High

4. SYNCHRONOUS (inline updates)
   App β†’ Database β†’ Webhook β†’ Index
         (writes)     ↓         ↓
                   Trigger   Update
   Latency: Milliseconds
   Cost: Highest
   Complexity: Highest

Choosing the right pattern:

  • Batch: Analytics, historical archives, non-critical updates
  • Micro-batch: News feeds, social media, moderate freshness needs
  • Streaming: Financial data, inventory, high-value real-time apps
  • Synchronous: Critical operations, small scale, strong consistency

πŸ”§ Try this: Map your use cases to update patterns. If you have mixed requirements, implement multiple patterns for different data types.

Data Versioning & Rollback πŸ”™

Data versioning maintains historical snapshots of your index, enabling rollback when issues arise.

Why version your index?

  • πŸ› Bug recovery: Roll back if pipeline introduces corrupted data
  • πŸ”¬ A/B testing: Compare search quality across index versions
  • πŸ“Š Compliance: Maintain audit trail of data changes
  • βͺ Safe deployments: Test new indexing logic against production snapshot

Versioning strategies:

  1. Index aliasing: Point alias to active version

    products_v1 ← products (alias)
    products_v2
    products_v3
    
    # Switch traffic:
    products (alias) β†’ products_v3
    
    • Zero-downtime switching
    • Instant rollback
    • Doubles storage temporarily
  2. Snapshot-based: Periodic backups

    daily_snapshot_2026_05_20
    daily_snapshot_2026_05_21
    daily_snapshot_2026_05_22
    
    • Point-in-time recovery
    • Storage-efficient (incremental)
    • Restore takes time
  3. Document versioning: Store version history per document

    {
      "id": "doc123",
      "current_version": 5,
      "versions": [
        {"v": 1, "content": "...", "timestamp": "..."},
        {"v": 2, "content": "...", "timestamp": "..."},
        {"v": 5, "content": "...", "timestamp": "..."}
      ]
    }
    
    • Granular history
    • High storage cost
    • Enables temporal queries ("What did this document say last month?")

Retention policies:

  • Keep last N versions
  • Retain versions for X days
  • Compress old versions
  • Archive to cold storage

⚠️ Warning: Versioning multiplies storage costs. Budget accordingly and implement aggressive retention policies.

Deletion & Archival Strategies πŸ—„οΈ

Hard deletion removes data permanently. Soft deletion marks data as deleted but retains it. Archival moves data to cheaper storage.

Strategy Use Case Pros Cons
Hard delete GDPR compliance, sensitive data Frees storage, meets regulations Irreversible, no recovery
Soft delete User content, accidental deletion Recoverable, audit trail Index bloat, query complexity
Archival Old data, compliance retention Cost-efficient, accessible Slower retrieval

Soft deletion implementation:

{
  "id": "doc456",
  "content": "Some content...",
  "deleted": true,
  "deleted_at": "2026-05-15T10:30:00Z",
  "deleted_by": "user789"
}

Query filtering:

// Exclude soft-deleted by default
query = {
  bool: {
    must: { match: { content: searchTerms } },
    filter: { term: { deleted: false } }
  }
}

Archival tiers (hot β†’ warm β†’ cold):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DATA TEMPERATURE TIERS                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                              β”‚
β”‚  πŸ”₯ HOT (0-30 days)                          β”‚
β”‚  Fast SSDs, high IOPS                        β”‚
β”‚  $$$$ per GB                                 β”‚
β”‚  Query latency: <50ms                        β”‚
β”‚            ↓                                 β”‚
β”‚  🌑️ WARM (31-90 days)                        β”‚
β”‚  Standard storage, moderate IOPS             β”‚
β”‚  $$ per GB                                   β”‚
β”‚  Query latency: <500ms                       β”‚
β”‚            ↓                                 β”‚
β”‚  ❄️ COLD (90+ days)                          β”‚
β”‚  Archive storage, low IOPS                   β”‚
β”‚  $ per GB                                    β”‚
β”‚  Query latency: seconds                      β”‚
β”‚            ↓                                 β”‚
β”‚  🧊 GLACIER (1+ years)                       β”‚
β”‚  Deep archive, rare access                   β”‚
β”‚  Β’ per GB                                    β”‚
β”‚  Retrieval time: hours                       β”‚
β”‚                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Lifecycle automation:

lifecycle_policy:
  - rule: move_to_warm
    condition: age > 30_days
    action: transition_to_warm_storage
  
  - rule: move_to_cold
    condition: age > 90_days AND access_count < 10
    action: transition_to_cold_storage
  
  - rule: archive
    condition: age > 365_days
    action: move_to_glacier
  
  - rule: purge
    condition: age > 2555_days  # 7 years
    action: permanent_delete

πŸ’‘ Cost optimization: Moving data from hot to cold storage can reduce costs by 90%+. Automate transitions based on access patterns.

Real-World Examples

Example 1: E-commerce Product Catalog πŸ›’

Scenario: An online retailer with 10 million products needs to keep pricing and inventory accurate.

Challenges:

  • Prices change thousands of times per hour (dynamic pricing)
  • Inventory updates in real-time as orders process
  • Product descriptions updated by merchants irregularly
  • Seasonal products need automatic archival

Solution architecture:

  1. Multi-tier update strategy:

    • Critical data (price, stock): Streaming updates via CDC
    • Semi-critical (ratings, reviews): Micro-batch every 5 minutes
    • Static content (descriptions, specs): Batch overnight
  2. TTL implementation:

    {
      "product_id": "SHOE-123",
      "price": 89.99,
      "price_ttl": 300,  // 5 minutes for dynamic pricing
      "inventory": 47,
      "inventory_ttl": 60,  // 1 minute - critical
      "description": "Comfortable running shoes...",
      "description_ttl": 86400  // 24 hours - rarely changes
    }
    
  3. Change detection:

    • Database triggers on price and inventory columns
    • Events published to Kafka topic
    • Stream processor updates index in < 2 seconds
  4. Archival policy:

    • Discontinued products β†’ soft delete after 30 days
    • Soft deleted β†’ move to cold storage after 90 days
    • Seasonal items (e.g., Halloween costumes) β†’ auto-archive post-season

Results:

  • 99.9% of searches return accurate pricing
  • Reduced "out of stock" complaints by 85%
  • Storage costs reduced 40% via archival

Example 2: News Aggregation Platform πŸ“°

Scenario: A news aggregator ingesting 100,000+ articles daily from various sources.

Challenges:

  • Breaking news must appear instantly
  • Old news loses value rapidly ("yesterday's news")
  • Different content types age differently
  • Storage costs for historical archive

Solution architecture:

  1. Content-aware TTL:

    def calculate_ttl(article):
        base_ttl = 7 * 24 * 3600  # 7 days
    
        if article.category == 'breaking':
            return 24 * 3600  # 1 day (high churn)
        elif article.category == 'analysis':
            return 30 * 24 * 3600  # 30 days (evergreen)
        elif article.category == 'sports':
            return 3 * 24 * 3600  # 3 days (event-driven)
    
        # Adjust based on engagement
        if article.views > 10000:
            return base_ttl * 2  # Popular content stays longer
    
        return base_ttl
    
  2. Time-decay scoring:

    // Boost recent articles in search ranking
    score = relevance_score * time_decay_factor
    
    time_decay_factor = exp(-lambda * age_hours)
    // lambda = 0.1 means half-life of ~7 hours
    
  3. Tiered storage:

    • Today (hot): Full-text search, fast retrieval
    • This week (warm): Searchable, acceptable latency
    • This month (cold): Compressed, slower retrieval
    • Archive (glacier): Compliance only, no search
  4. Automatic lifecycle:

    - Age 0-24h: HOT tier, prominent in feeds
    - Age 1-7d: WARM tier, normal search results
    - Age 7-30d: COLD tier, "older results" section
    - Age 30d+: ARCHIVE tier, excluded from search (available via URL)
    

Results:

  • Breaking news indexed in < 5 seconds
  • 70% storage reduction vs. keeping all content hot
  • User satisfaction up 25% (more relevant results)

Example 3: Enterprise Knowledge Base πŸ“š

Scenario: A company wiki with 500,000 documents, frequent edits, and strict compliance requirements.

Challenges:

  • Must maintain edit history for audit (7 years)
  • Outdated documentation causes support tickets
  • Different departments have different freshness needs
  • GDPR requires ability to purge employee data

Solution architecture:

  1. Document versioning:

    {
      "doc_id": "KB-4567",
      "title": "Password Reset Procedure",
      "current_version": 12,
      "content": "<current content>",
      "versions": [
        {"v": 1, "author": "alice", "date": "2019-03-15", "hash": "abc123"},
        {"v": 12, "author": "bob", "date": "2026-05-10", "hash": "def456"}
      ],
      "last_reviewed": "2026-05-10",
      "review_due": "2026-08-10"  // Quarterly review
    }
    
  2. Freshness indicators:

    • Flag documents not reviewed in 90 days with ⚠️ warning
    • Auto-email doc owners 30 days before review due
    • Surface "potentially outdated" banner in search results
  3. Departmental policies:

    • Engineering docs: Review every 3 months (fast-moving)
    • HR policies: Review annually (stable)
    • Legal: Review only on regulation change
  4. Compliance-ready deletion:

    def gdpr_delete(user_id):
        # Find all documents authored/edited by user
        docs = index.query({"author": user_id})
    
        for doc in docs:
            # Keep document structure, anonymize author
            doc.author = "[REDACTED]"
            doc.author_email = None
    
            # Remove from version history
            for version in doc.versions:
                if version.author == user_id:
                    version.author = "[REDACTED]"
    
            index.update(doc)
    
        # Hard delete personal profile
        index.delete({"type": "user_profile", "id": user_id})
    
  5. Incremental updates with validation:

    • Authors edit via web interface
    • On save, webhook triggers index update
    • Validation rules check:
      • No broken internal links
      • Required metadata present
      • Proper approval for policy docs
    • Failed validation β†’ don't index, notify author

Results:

  • Support tickets for "incorrect documentation" down 60%
  • 100% compliance with data retention policies
  • Average document age reduced from 18 months to 4 months

Example 4: Medical Research Database πŸ₯

Scenario: A RAG system answering questions from published medical literature (millions of papers).

Challenges:

  • Papers are sometimes retracted (must remove immediately)
  • Preprints vs. peer-reviewed have different authority levels
  • New research can contradict older findings
  • Citations and impact evolve over time

Solution architecture:

  1. Multi-version indexing:

    {
      "paper_id": "doi:10.1234/example",
      "versions": [
        {
          "version": "preprint_v1",
          "status": "preprint",
          "date": "2025-01-15",
          "indexed": true,
          "boost": 0.5  // Lower ranking for preprints
        },
        {
          "version": "published",
          "status": "peer_reviewed",
          "date": "2025-06-20",
          "indexed": true,
          "boost": 1.0
        }
      ],
      "retracted": false,
      "citation_count": 47,
      "last_citation_update": "2026-05-15"
    }
    
  2. Retraction handling:

    def handle_retraction(paper_id, retraction_notice):
        # Immediate hard delete from active index
        index.delete(paper_id)
    
        # Move to "retracted" collection with notice
        retracted_index.add({
            "original_id": paper_id,
            "retraction_date": datetime.now(),
            "retraction_reason": retraction_notice,
            "original_content": get_snapshot(paper_id)
        })
    
        # Update citing papers with warning
        citing_papers = find_citations_to(paper_id)
        for citing in citing_papers:
            citing.add_warning(f"References retracted paper {paper_id}")
            index.update(citing)
    
  3. Dynamic relevance scoring:

    • Recent papers get recency boost (last 2 years)
    • High citation count increases authority
    • Peer-reviewed > preprints
    • Retractions immediately zeroed out
  4. Citation freshness:

    • Weekly batch job updates citation counts from external APIs
    • Triggers reindexing of papers with significant citation change (>20%)
    • Enables "trending" queries (papers gaining attention)
  5. Compliance and ethics:

    • Patient data: immediate hard delete on request (HIPAA)
    • Author opt-out: soft delete + anonymize
    • Clinical trial data: retain for legal minimum (10 years)

Results:

  • Zero incidents of citing retracted papers
  • 95% of results from last 5 years (appropriate for fast-moving field)
  • Compliance with medical data regulations

Common Mistakes to Avoid ⚠️

1. Ignoring Partial Update Failures

❌ Wrong approach: Assume all updates succeed

for doc in updated_docs:
    index.update(doc)  # What if this fails halfway?

βœ… Correct approach: Track failures and implement retry logic

failed_updates = []
for doc in updated_docs:
    try:
        index.update(doc)
    except Exception as e:
        failed_updates.append({"doc": doc, "error": e})
        log_failure(doc.id, e)

if failed_updates:
    retry_with_backoff(failed_updates)

Why it matters: A partial batch failure can leave your index in an inconsistent state. Users see some updated content, some staleβ€”undermining trust.

2. Not Accounting for Clock Skew

❌ Wrong approach: Compare timestamps across distributed systems

if source_timestamp > indexed_timestamp:
    update_index()

βœ… Correct approach: Use version numbers or vector clocks

if source_version > indexed_version:
    update_index()

Why it matters: Servers have clock drift. A source server 5 minutes ahead could make fresh data appear stale, causing unnecessary reindexing.

3. Aggressive TTL Without Grace Period

❌ Wrong approach: Hard cutoff at TTL

if age > ttl:
    delete_from_index(doc)

βœ… Correct approach: Grace period + soft warning

if age > ttl:
    doc.mark_as_stale()
    doc.reduce_ranking_boost()
    
if age > ttl + grace_period:
    delete_from_index(doc)

Why it matters: If your refresh pipeline has a hiccup, aggressive TTL could delete all your data. Grace periods prevent catastrophic data loss.

4. Forgetting to Propagate Deletes

❌ Wrong approach: Only sync additions and updates

for doc in get_changed_docs():
    if doc.exists():
        index.upsert(doc)
    # Deletes never reach the index!

βœ… Correct approach: Explicitly handle deletions

for change in get_change_stream():
    if change.type == 'INSERT' or change.type == 'UPDATE':
        index.upsert(change.doc)
    elif change.type == 'DELETE':
        index.delete(change.doc_id)

Why it matters: Deleted source documents lingering in search results frustrate users and waste compute resources.

5. No Monitoring of Data Freshness

❌ Wrong approach: Assume pipeline is working

βœ… Correct approach: Active monitoring and alerting

metrics = {
    "avg_index_lag": calculate_avg_lag(),
    "p99_staleness": get_percentile(99, doc_ages),
    "failed_updates_count": count_failures_last_hour(),
    "oldest_document_age": max(doc_ages)
}

if metrics["p99_staleness"] > SLA_THRESHOLD:
    alert_oncall("Data freshness SLA breach")

Why it matters: You can't fix what you don't measure. Silent failures mean users get stale data without anyone knowing.

6. Overindexing Volatile Data

❌ Wrong approach: Reindex on every tiny change

## Stock price updates every 100ms
for price_update in price_stream:
    index.update_document(ticker, new_price)
    # Thrashing the index!

βœ… Correct approach: Batch rapid updates or use cache

## Update index every 5 seconds with latest price
price_buffer = {}

def buffer_price_update(ticker, price):
    price_buffer[ticker] = price

scheduled_every(5_seconds):
    index.bulk_update(price_buffer)
    price_buffer.clear()

Why it matters: Excessive indexing wastes CPU/IO and can actually increase query latency due to lock contention.

7. Inconsistent TTL Across Pipelines

❌ Wrong approach: Different TTL settings in different systems

## Cache layer
ttl: 3600  # 1 hour

## Index layer  
ttl: 7200  # 2 hours

## API response cache
ttl: 1800  # 30 minutes

βœ… Correct approach: Centralized TTL configuration

## config.yaml - single source of truth
ttl_policies:
  product_data: 3600
  user_profiles: 7200
  session_data: 1800

## All systems read from same config

Why it matters: Inconsistent TTLs create confusionβ€”users see different data freshness depending on which path serves their request.

Key Takeaways 🎯

  1. Data freshness is a spectrum, not binary. Define acceptable staleness for each data type based on business requirements.

  2. TTL policies prevent index bloat and ensure users see relevant content. Implement document-level TTL for flexibility.

  3. Incremental updates are essential for large-scale systems. Use CDC, timestamps, or version numbers to track changes efficiently.

  4. Update propagation latency ranges from milliseconds (synchronous) to days (batch). Choose based on freshness requirements and resource constraints.

  5. Version your indexes to enable safe rollbacks and A/B testing. Use index aliasing for zero-downtime switches.

  6. Implement tiered storage (hot/warm/cold) to optimize costs. Automate lifecycle transitions based on age and access patterns.

  7. Soft deletes provide safety nets for accidental deletions. Hard deletes are for compliance (GDPR) and sensitive data.

  8. Monitor freshness metrics actively: index lag, staleness percentiles, failed updates. Alert when SLAs breach.

  9. Plan for partial failures. Distributed systems fail in creative waysβ€”implement retries, idempotency, and dead-letter queues.

  10. Balance freshness and cost. Real-time updates are expensive. Not every data type needs millisecond freshness.

πŸ“‹ Quick Reference Card: Data Freshness Checklist

Define Staleness SLAs Set acceptable lag for each data type (seconds/hours/days)
Implement Change Detection Timestamps, versions, CDC, or checksums
Choose Update Pattern Batch, micro-batch, streaming, or synchronous
Set TTL Policies Document-level expiration based on content type
Enable Versioning Index aliasing or snapshots for rollback capability
Configure Archival Hot β†’ warm β†’ cold β†’ glacier transitions
Handle Deletes Soft delete by default, hard delete for compliance
Monitor Metrics Index lag, staleness p99, failed updates, oldest doc age
Plan for Failures Retries, dead-letter queues, alerting
Test Rollback Regularly verify you can restore from backups/versions

πŸ“š Further Study

🧠 Mnemonic for update patterns: "Bob Makes Silly Sandwiches" = Batch, Micro-batch, Streaming, Synchronous

πŸŽ“ Next steps: Practice implementing a TTL policy in your chosen vector database. Start with a simple document-level expiration field and expand to automated archival as you gain confidence.