Context Engineering

Q: The `ContextOverflowError` wrapper below converts silent API truncation into an explicit failure: ```python class ContextOverflowError(Exception): pass def call_model_with_overflow_detection(messages, model_context_limit, model_client, tokenizer): token_count = sum( len(tokenizer.encode(str(m.get("content", "")))) for m in messages ) safe_limit = int(model_context_limit * {{1}}) if token_count > safe_limit: raise {{2}}( f"Context size {token_count} exceeds safe limit {safe_limit}." ) return model_client.complete(messages=messages) ``` What two values complete this function correctly?

["0.85", "ContextOverflowError"]

Master context as a finite resource: layers, budgets, offloading strategies, and project-level context as code.

Last generated May 19, 2026 UTC

Why Context Is the Bottleneck in Agentic Systems

Imagine handing a contractor a detailed project brief, watching them work for hours, and then discovering they've quietly forgotten half of it — not because they stopped caring, but because the sheet of paper it was written on silently shrank mid-task. They're still working, still producing output, still confident in what they're doing. But the constraints you specified on page two are gone. This is, more or less, what happens when an AI agent runs out of context — and the disconcerting part isn't the failure itself, it's that the system rarely tells you it's happening.

Every large language model operates on a context window: a fixed-size buffer of tokens that represents everything the model can "see" at the moment it generates a response. Instructions, conversation history, tool outputs, retrieved documents, and reasoning traces all compete for the same finite space. In a simple chat interaction, this constraint is easy to ignore — most conversations stay well within budget. But in agentic workflows, where an AI autonomously invokes tools, interprets results, and loops across many steps toward a goal, the window fills up fast.

The Fixed-Window Problem

LLMs don't have working memory in the way humans do. There is no separate long-term store the model passively draws from during inference. There is only the context window: a flat sequence of tokens with a hard upper bound. Whether that ceiling is tens of thousands or hundreds of thousands of tokens, it is finite, it is shared by every component, and it is consumed in real time.

In a single prompt-response exchange, the context contains your question and maybe some background — clean and bounded. In an agentic loop, consider what the context must hold at step N of a multi-step task:

┌─────────────────────────────────────────────────────────────┐
│                     CONTEXT WINDOW (fixed size)             │
├─────────────────────────────────────────────────────────────┤
│  System Prompt         ████████████  (static, always present)│
│  User Goal / Task      ████          (set at task start)     │
│  Turn 1: Agent thought ████          (accumulates each step) │
│  Turn 1: Tool call     ████          (accumulates each step) │
│  Turn 1: Tool result   ████████████  (can be very large)     │
│  Turn 2: Agent thought ████                                  │
│  Turn 2: Tool call     ████                                  │
│  Turn 2: Tool result   ████████████                          │
│  Turn 3: Agent thought ████                                  │
│  Turn 3: Tool call     ████                                  │
│  Turn 3: Tool result   ████████████  ← overflow risk here   │
│  Retrieved documents   ████████████████████                  │
│  Available space for   ██            ← almost gone          │
│  next reasoning step                                         │
└─────────────────────────────────────────────────────────────┘

Every component competes for the same fixed total. The system prompt you wrote doesn't get a reserved lane — it occupies real tokens on every single call.

💡 Mental Model: Think of the context window like RAM in a computer — not disk, not cloud storage, but RAM. It's fast, it's what the processor (the model) directly operates on, and it has a hard ceiling. External memory can supplement it, but the model only reasons over what's currently loaded into that window.

Why Agentic Workflows Hit the Ceiling Faster

Single-turn chat is essentially stateless from the model's perspective: question in, answer out, window cleared. Agentic systems break this pattern completely. The three main contributors to rapid context growth are:

🔧 Tool call results — When an agent calls a web search API, a code interpreter, or an external service, the result lands in the context. A single search result can be thousands of tokens.

🧠 Intermediate reasoning — Models that use chain-of-thought reasoning generate tokens explaining their thinking before each action. A ten-step task where the agent writes two hundred tokens of reasoning per step adds two thousand tokens of "thinking" before the task is done.

📚 Multi-step action history — The agent needs to remember what tools were called, what arguments were passed, and what came back. By default, that record lives in the context window.

Here is a minimal Python sketch that makes this growth rate concrete and measurable:

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Return the token count for a string using the model's tokenizer."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def log_context_budget(turn: int, components: dict[str, str], budget: int = 128_000) -> None:
    """
    Log token usage per component and flag when budget is running low.

    components: dict mapping component name to its string content
    budget: total context window size in tokens
    """
    total = 0
    print(f"\n--- Turn {turn} context breakdown ---")
    for name, content in components.items():
        tokens = count_tokens(content)
        total += tokens
        pct = (tokens / budget) * 100
        print(f"  {name:<25} {tokens:>6} tokens  ({pct:.1f}%)")

    used_pct = (total / budget) * 100
    print(f"  {'TOTAL':<25} {total:>6} tokens  ({used_pct:.1f}% of budget)")

    if used_pct > 75:
        print(f"  ⚠️  Warning: context at {used_pct:.0f}% — consider compressing history")

## Simulating how context grows across an agent's turns
system_prompt = "You are a research agent. Your goal is to... [500 words of instructions]" * 3

turn_1_history = "User: Find recent papers on context compression.\nAgent: I'll search now.\nTool result: [abstract 1]... [abstract 2]... [abstract 3]..."
turn_2_history = turn_1_history + "\nAgent: I found three papers. Let me retrieve the full text of the most relevant one.\nTool result: [full paper text, 4000 words]..."
turn_3_history = turn_2_history + "\nAgent: Now let me extract key findings and cross-reference with another source.\nTool result: [second source, 3000 words]..."

for turn_num, history in enumerate(["", turn_1_history, turn_2_history, turn_3_history], start=1):
    log_context_budget(turn_num, {
        "system_prompt": system_prompt,
        "task_history": history,
    })

By logging token counts per component at each turn, you stop treating context as an invisible resource and start treating it like something that can run out.

Silent Degradation: The Failure Mode Nobody Told You About

When a context window fills up, the system does not throw an error. There is no ContextOverflowException. The API either silently truncates the oldest tokens before sending the request, or the developer's own code drops history without realizing the implications. The agent continues operating — just with incomplete information.

The symptom profile of context overflow tends to follow a predictable arc:

Instruction drift — The agent starts violating constraints specified early in the system prompt. It was told to never make external API calls without user confirmation; three hundred turns in, it starts making them freely. The constraint wasn't revoked — it was truncated away.
Repeated work — The agent re-invokes a tool it already called with the same arguments, because the record of that earlier call has been pushed out of the window.
Hallucinated state — The agent confidently refers to a result it "retrieved" but which has scrolled out of context. Without the actual data in the window, the model fills the gap with plausible-sounding invention.

def detect_context_overflow(messages: list[dict], model_limit: int, model: str = "gpt-4o") -> bool:
    """
    Explicitly check for overflow BEFORE sending to the API.
    Never rely on silent API-side truncation.
    """
    full_text = " ".join(m["content"] for m in messages if isinstance(m.get("content"), str))
    total_tokens = count_tokens(full_text, model)

    if total_tokens > model_limit * 0.90:  # flag at 90% to give reasoning room
        print(f"[OVERFLOW RISK] {total_tokens} tokens — {model_limit * 0.90:.0f} threshold exceeded")
        print("  → Triggering compression before next call")
        return True
    return False

## Usage in a simplified agent loop skeleton:
## messages = build_current_context(history, system_prompt, retrieved_docs)
## if detect_context_overflow(messages, model_limit=128_000):
##     messages = compress_history(messages)   # covered in the next section
## response = call_model(messages)

Checking for overflow before the API call rather than relying on downstream behavior is the minimal change that separates agents that degrade silently from agents that degrade detectably.

🤔 Did you know? Context truncation behavior varies across providers and even across API versions. Some truncate from the beginning of the message list; others may behave differently depending on configuration. Explicit overflow detection in your own code is more durable than relying on predictable API behavior.

Context as a Scarce Resource: The Engineering Mindset Shift

The leap from brittle agent to reliable agent is, in large part, a mindset shift about what context is. Most developers initially treat it as a configuration detail — something you set once and stop thinking about. The engineering discipline that agentic systems actually require is closer to how embedded systems engineers think about memory: every allocation is intentional, every byte that goes in was weighed against what it displaces, and the system has explicit logic for when to reclaim space.

🎯 Key Principle: The context window is not a staging area where you accumulate everything that might be useful. It is a working surface with a fixed area, and good context engineering is the practice of keeping the highest-value information on that surface at every moment of the agent's execution.

❌ Wrong thinking: "I'll pick a model with a 200K context window — that's big enough that I don't need to worry about this."

✅ Correct thinking: "A larger window delays the problem; it doesn't eliminate it. A 200K window filled with unmanaged history degrades just as silently as a 4K one. The same discipline applies at any scale."

The practical techniques for keeping context within budget — filtering what enters, compressing what's already there, and offloading to external storage — are the subject of the sections that follow. This section's job is the foundation: establishing that context is the central constraint that every downstream architectural decision flows from.

What Context Actually Contains: A Structural Overview

Before you can manage a context window intelligently, you need a precise picture of what is actually inside it. "Context" is often treated as a monolithic blob — the stuff the model sees — but at runtime it is composed of distinct components with very different growth rates, lifetimes, and management requirements.

The Four Main Components

At runtime, an agent's context window is typically occupied by four classes of content:

┌─────────────────────────────────────────────────────────┐
│                    CONTEXT WINDOW                       │
│  ┌───────────────────────────────────────────────────┐  │
│  │  1. SYSTEM INSTRUCTIONS          (static budget)  │  │
│  │     role definition, constraints, tool schemas    │  │
│  ├───────────────────────────────────────────────────┤  │
│  │  2. CONVERSATION & ACTION HISTORY (growing fast)  │  │
│  │     user turns, assistant turns, tool calls,      │  │
│  │     tool outputs, intermediate reasoning          │  │
│  ├───────────────────────────────────────────────────┤  │
│  │  3. RETRIEVED / INJECTED KNOWLEDGE  (on demand)   │  │
│  │     doc chunks, DB results, memory reads          │  │
│  ├───────────────────────────────────────────────────┤  │
│  │  4. STRUCTURED STATE SUMMARY       (compressed)   │  │
│  │     condensed agent beliefs, task progress        │  │
│  └───────────────────────────────────────────────────┘  │
│                                                         │
│  Total used: ████████████████████░░░░  (e.g. 78k/128k)  │
└─────────────────────────────────────────────────────────┘

Each component has a distinct character. System instructions are nearly constant across calls. Conversation and action history grows monotonically with each step. Retrieved knowledge spikes at injection and then stays flat until the next retrieval. State summaries are written deliberately to replace other components that have been compressed away.

Component 1: System Instructions

System instructions establish the agent's role, behavioral constraints, output format expectations, and the schemas or descriptions of available tools. They share two properties that make them worth understanding separately from everything else. First, they are essentially static: the system prompt changes rarely if at all between calls. Second, they represent a fixed token tax on every single invocation. A system prompt for a coding agent with a detailed persona, ten tool descriptions, and a few dozen lines of behavioral rules can easily consume 2,000–5,000 tokens — before a single word of actual task context has been added.

⚠️ Common Mistake: Many developers write system prompts without ever counting their tokens, treating them as "free." On a 128k context window, a 6,000-token system prompt reserves nearly 5% of the entire budget before the conversation begins. At 32k, it reserves nearly 19%. Their cost should be known and factored into budget planning deliberately.

Component 2: Conversation and Action History

Conversation and action history is the most dynamic and, in practice, the most dangerous component of agent context. It accumulates every exchange: user messages, assistant responses, tool invocations, and the outputs those tool calls return.

A single agent step might involve one user message, one assistant reasoning step, three tool calls with arguments and full output, and one final assistant response. If each tool returns moderately detailed results, a single step can consume thousands of tokens. Across ten steps, you may have exhausted more budget from tool outputs alone than from the rest of the context combined.

Step 1:  [system: 4k] + [history: 0.5k] = 4.5k tokens
Step 3:  [system: 4k] + [history: 6k]   = 10k tokens
Step 7:  [system: 4k] + [history: 24k]  = 28k tokens
Step 12: [system: 4k] + [history: 61k]  = 65k tokens  ← halfway!
Step 18: [system: 4k] + [history: 110k] = 114k tokens  ← critical

Conversation and action history is the primary source of context overflow in agentic systems. It is not a static cost; it compounds with every step.

Component 3: Retrieved or Injected Knowledge

Retrieved knowledge covers any content that enters the context window on demand: document chunks fetched from a vector store, rows returned by a database query, results from a web search, or memories recalled from a prior session.

The defining property of retrieved knowledge is that it is discretionary — unlike system instructions and history, you choose what to inject and when. This makes it the component you have the most direct control over. It is also the component where the most common mistakes involve excess rather than scarcity. Each chunk injected displaces space that history and reasoning will need later.

The practical rule: retrieved content must be size-bounded before insertion. Whether that bound is enforced by selecting only the top-k most relevant chunks, by truncating to a character limit, or by summarizing before injection, some bound must exist. Retrieval without a budget constraint is context gambling, not context engineering.

Component 4: Structured State Summaries

Structured state summaries are the most deliberately engineered of the four components. Rather than allowing history to accumulate indefinitely, a well-designed agent periodically compresses what it has learned into a compact, structured representation of the current task state — and optionally discards or offloads the raw history used to produce it.

A state summary might look like:

### Current Task State (as of step 9)
- Goal: Migrate legacy user records to new schema
- Completed: Tables users, sessions, preferences (3/7 tables)
- In progress: Table permissions (schema mismatch on 'role_id' column)
- Blocked: Table audit_log (missing write permissions, ticket #4421 filed)
- Pending: Tables tokens, oauth_clients
- Key decisions made: Using NULL for missing legacy fields rather than defaults
- Last tool output: ALTER TABLE returned success for 'permissions' on retry

This summary might replace 8,000 tokens of raw history with 200 tokens of structured state — a 40:1 compression ratio — while preserving everything the agent needs to continue coherently. The trade-off is that fine-grained detail is lost: if the agent later needs to re-examine the exact output of a tool call from step 3, that output is no longer in the window.

💡 Mental Model: Think of structured state summaries as the agent equivalent of a standup report. A standup does not replay every commit and test run from the past sprint. It extracts the signal — what's done, what's blocked, what's next — in a form that is immediately actionable.

Making the Budget Visible: A Minimal Instrumented Agent Loop

Understanding these four components conceptually is useful. Being able to see their token costs at runtime is what makes the understanding actionable.

import tiktoken
from dataclasses import dataclass, field
from typing import Any

## Uses the cl100k_base tokenizer, broadly applicable to many current LLM APIs.
## Adjust the encoding name if your model uses a different tokenizer.
enc = tiktoken.get_encoding("cl100k_base")


def count_tokens(text: str) -> int:
    """Return the token count for a string."""
    return len(enc.encode(text))


@dataclass
class ContextBudgetLog:
    step: int
    system_tokens: int
    history_tokens: int
    retrieved_tokens: int
    state_summary_tokens: int

    @property
    def total(self) -> int:
        return (
            self.system_tokens
            + self.history_tokens
            + self.retrieved_tokens
            + self.state_summary_tokens
        )

    def report(self, window_size: int = 128_000) -> str:
        pct = self.total / window_size * 100
        return (
            f"Step {self.step:>3} | "
            f"system={self.system_tokens:>5} | "
            f"history={self.history_tokens:>6} | "
            f"retrieved={self.retrieved_tokens:>5} | "
            f"state={self.state_summary_tokens:>5} | "
            f"TOTAL={self.total:>6} ({pct:.1f}% of {window_size//1000}k)"
        )


@dataclass
class AgentContext:
    system_prompt: str
    history: list[dict[str, Any]] = field(default_factory=list)
    retrieved_chunks: list[str] = field(default_factory=list)
    state_summary: str = ""

    def measure(self, step: int) -> ContextBudgetLog:
        history_text = " ".join(
            str(msg.get("content", "")) for msg in self.history
        )
        retrieved_text = " ".join(self.retrieved_chunks)
        return ContextBudgetLog(
            step=step,
            system_tokens=count_tokens(self.system_prompt),
            history_tokens=count_tokens(history_text),
            retrieved_tokens=count_tokens(retrieved_text),
            state_summary_tokens=count_tokens(self.state_summary),
        )


def run_agent_loop(
    ctx: AgentContext,
    steps: int = 5,
    window_size: int = 128_000,
    warn_threshold: float = 0.75,
) -> None:
    """
    Simulated agent loop that measures context budget at each step.
    Replace the body of each step with actual LLM calls, tool invocations,
    and retrieval logic in production.
    """
    for step in range(1, steps + 1):
        ctx.history.append({"role": "user", "content": f"Step {step}: Continue the task."})
        ctx.history.append({
            "role": "assistant",
            "content": f"Step {step}: Reasoning and tool output would appear here. " * 30
        })
        # Simulate a retrieval event on even steps
        if step % 2 == 0:
            ctx.retrieved_chunks = ["Retrieved document chunk: " + "relevant content " * 50]
        else:
            ctx.retrieved_chunks = []  # Clear after use — don't carry forward stale retrievals

        budget = ctx.measure(step)
        print(budget.report(window_size))

        if budget.total / window_size >= warn_threshold:
            print(f"  ⚠️  WARNING: context at {budget.total/window_size:.0%} of budget — "
                  f"consider summarizing history ({budget.history_tokens} tokens).")


## Example usage
system = """You are a data migration assistant. Your job is to migrate
legacy database records to a new schema. Available tools: read_table,
write_table, check_schema, log_error. Never skip validation steps.
Always confirm destructive operations before proceeding."""

ctx = AgentContext(system_prompt=system)
run_agent_loop(ctx, steps=5, window_size=128_000, warn_threshold=0.75)

Note the ctx.retrieved_chunks = [] after odd steps. A common error is carrying retrieved chunks forward across steps even after they are no longer needed. Retrieved knowledge should be cleared from the active context after the step it was fetched for — unless it remains directly relevant to the immediately following step.

📋 Component Reference:

Component	Growth pattern	Controllability	Primary risk
System instructions	Fixed (static per agent)	Low — set at design time	Unaccounted baseline cost
Conversation & action history	Monotonically growing	Medium — requires active pruning	Overflow; most common cause
Retrieved / injected knowledge	Spikes on retrieval events	High — discretionary injection	Over-injection, stale chunks
Structured state summary	Controlled (written deliberately)	High — replaced on demand	Loss of fine-grained history

Thinking of context as these four distinct layers — with different growth behaviors and different control levers — gives you a practical framework for diagnosis. When a long-running agent starts behaving strangely, the first question is not "what did the model forget?" but rather "which component consumed the budget, and what got displaced?"

Controlling What Enters the Window: Filtering, Compression, and Offloading

Once you accept that context is a finite, runtime-allocated budget, the next practical question becomes: what do you actually do about it? The answer is a layered set of decisions made at different points in an agent's lifecycle — before content enters the window, while it sits in the window, and after it has served its immediate purpose.

Relevance Filtering Before Injection

The cheapest token is the one that never enters the window. Relevance filtering is the practice of selecting only the most pertinent content from a larger retrieval result before injecting it into context — rather than inserting entire documents and hoping the model finds what it needs.

The two most common filtering mechanisms are embedding similarity and keyword overlap. Embedding similarity computes a vector representation of the current query or agent state and ranks retrieved chunks by cosine distance, selecting only the top-k. Keyword overlap (including BM25-style scoring) is faster and more interpretable — it rewards chunks that share specific terms with the query.

from typing import List
import numpy as np

def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Compute cosine similarity between two embedding vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

def filter_chunks_by_relevance(
    query_embedding: List[float],
    chunks: List[dict],  # each dict has 'text' and 'embedding' keys
    top_k: int = 3,
    min_score: float = 0.70,
) -> List[str]:
    """
    Return the top_k most relevant chunk texts above min_score.
    Chunks that don't clear the threshold are discarded entirely.
    """
    scored = [
        (chunk["text"], cosine_similarity(query_embedding, chunk["embedding"]))
        for chunk in chunks
    ]
    filtered = [
        text for text, score in sorted(scored, key=lambda x: x[1], reverse=True)
        if score >= min_score
    ][:top_k]
    return filtered

The min_score floor is important: without it, a top-k selection might include chunks that are genuinely irrelevant simply because they ranked highest among poor results.

⚠️ Common Mistake: Setting top_k without a minimum score floor. If retrieval quality is low, the "best" chunks may still be irrelevant — and you'll inject noise without realizing it.

Summarization as Compression

Filtering prevents bloat from incoming content. Summarization as compression addresses the bloat that builds up inside the window over time — specifically, long tool outputs and conversation history that has served its purpose but still occupies tokens.

The pattern: once a segment of context exceeds a token threshold, trigger a secondary LLM call to replace that segment with a shorter summary. The summary becomes the new canonical record; the verbose original is discarded from the window (though it should be written to external storage first — more on that below).

🎯 Key Principle: Summarization is lossy by design. You are explicitly trading completeness for headroom. The design question is not whether to lose information, but which information is safe to lose and when the trade is worth making.

from dataclasses import dataclass
from typing import List, Callable

@dataclass
class Message:
    role: str  # 'user', 'assistant', or 'tool'
    content: str
    token_count: int

def count_tokens(text: str) -> int:
    """Approximate token count (replace with a real tokenizer in production)."""
    return len(text.split())

def summarize_segment(
    messages: List[Message],
    llm_call: Callable[[str], str],  # injected LLM caller for testability
) -> Message:
    """Replace a list of messages with a single summary message."""
    combined = "\n".join(f"[{m.role}]: {m.content}" for m in messages)
    prompt = (
        "Summarize the following agent conversation segment concisely. "
        "Preserve all tool results, decisions made, and any constraints mentioned. "
        "Discard greetings, repeated confirmations, and redundant reasoning.\n\n"
        f"{combined}"
    )
    summary_text = llm_call(prompt)
    return Message(
        role="system",
        content=f"[SUMMARY OF PRIOR STEPS]: {summary_text}",
        token_count=count_tokens(summary_text),
    )

def maybe_compress_history(
    history: List[Message],
    token_budget: int,
    compression_threshold: float,
    llm_call: Callable[[str], str],
) -> List[Message]:
    """
    If history exceeds compression_threshold * token_budget,
    compress the oldest half into a single summary.
    """
    total_tokens = sum(m.token_count for m in history)
    if total_tokens < compression_threshold * token_budget:
        return history

    midpoint = len(history) // 2
    to_compress = history[:midpoint]   # oldest messages
    to_keep = history[midpoint:]       # most recent messages

    summary_msg = summarize_segment(to_compress, llm_call)
    return [summary_msg] + to_keep

This function compresses proactively — before the window is full, not after. Compressing the oldest half preserves recent context (which is usually more relevant to the current step) while reclaiming tokens from earlier turns.

⚠️ Common Mistake: Compressing the most recent messages instead of the oldest ones because they happen to be longer. This discards the freshest state and keeps stale reasoning.

External Memory Offloading

Summarization compresses what stays in the window. External memory offloading removes content from the window entirely — writing it to a key-value store or vector database and retrieving only what the current step actually needs.

┌─────────────────────────────────────────────────────┐
│              Agent Step N                           │
│                                                     │
│  ┌─────────────┐    retrieve    ┌───────────────┐  │
│  │  Context    │◄──(relevant)───│ Vector Store  │  │
│  │  Window     │                │ Key-Value DB  │  │
│  │             │───(write)─────►│               │  │
│  │ [system]    │  new results   │ step_1_output │  │
│  │ [summary]   │                │ step_2_output │  │
│  │ [retrieved] │                │ step_3_output │  │
│  └─────────────┘                └───────────────┘  │
│                                                     │
│  Token budget stays bounded ─ storage grows freely  │
└─────────────────────────────────────────────────────┘

A practical rule: any tool output longer than a few hundred tokens should be written to external storage and replaced in the window with a brief pointer — e.g., "[Step 2 output stored at key: step_2_api_response — retrieve if needed]". This approach also creates a natural audit trail for debugging long-running tasks.

💡 Mental Model: Think of the context window as a desk and external memory as a filing cabinet. You keep only what you're actively working with on the desk; everything else goes in the cabinet with a label so you can find it.

Truncation Strategies and Their Trade-offs

Sometimes there isn't time or budget for elegant compression. Truncation — dropping content to fit the window — is the blunter instrument, and it has genuine failure modes worth understanding before you reach for it.

Strategy	What gets dropped	Primary failure mode
Sliding window	Oldest messages first	Loses initial instructions and early constraints
Selective pruning	Lowest-relevance turns	Requires scoring; may drop load-bearing context
Hard cutoff	Everything beyond a token limit	May cut mid-sentence or mid-tool-output

Sliding window truncation is simplest: keep the system prompt plus the N most recent messages, drop everything older. Its failure mode is predictable — tasks that require remembering something established early will silently fail once that early context scrolls out. Always treat the system prompt as exempt from sliding window truncation.

Selective pruning scores each turn for relevance and drops the lowest scorers. A turn that says "OK, confirmed" looks low-relevance but might be the agent's record of a constraint acknowledgment.

Hard cutoff is easy to reason about but dangerous because it can drop content mid-structure — truncating a JSON object halfway through produces malformed context that causes unpredictable model behavior.

🎯 Key Principle: No truncation strategy is safe by default. Each requires explicit monitoring to detect when it has dropped load-bearing context.

Idempotent Context Reconstruction

Filtering, compression, and offloading all make the same implicit promise: that the context the agent sees at each step is a faithful-enough representation of the world state it needs. Idempotent context reconstruction is the design discipline that makes that promise explicit and testable.

The idea: design your agent's external state so that the full context for any step can be rebuilt from scratch by replaying what's in storage. This means:

Every tool output is written to external storage before it's processed, not after.
Summaries are stored alongside the original segments they compressed.
The agent's current step index and key decisions are written to a durable key-value store.
The context-building function is a pure function of the stored state — given the same state, it produces the same context every time.

┌─────────────────────────────────────────────────────────────┐
│           Idempotent Context Reconstruction                 │
│                                                             │
│  External Storage                   Context Window          │
│  ┌─────────────────────┐            ┌──────────────────┐   │
│  │ step: 4             │            │ [system prompt]  │   │
│  │ decision_1: "use A" │──rebuild──►│ [summary 1-3]    │   │
│  │ step_1_output: ...  │            │ [step_4 context] │   │
│  │ step_2_output: ...  │            │ [retrieved: X,Y] │   │
│  │ step_3_output: ...  │            └──────────────────┘   │
│  │ summary_1_3: ...    │                                    │
│  └─────────────────────┘                                    │
│                                                             │
│  Same storage → same context, every time                    │
└─────────────────────────────────────────────────────────────┘

The practical payoff is pause-and-resume: if an agent is interrupted mid-task, it can be restarted at any step by reconstructing context from storage rather than re-running from the beginning. Idempotent reconstruction also makes filtering and compression reversible in principle — because the originals are in storage, a later step can always retrieve a full tool output if it turns out to need it.

🧠 Mnemonic: STORE before you SCORE — write the raw result to external storage before you score, filter, or summarize it. This ensures you never permanently lose what you've seen.

Putting It Together: A Layered Decision Flow

These techniques aren't alternatives — they're layers applied at different moments:

RETRIEVAL                 INJECTION                ACCUMULATION
─────────                 ─────────                ────────────
Fetch N chunks            Filter to top-k          Track token count
     │                         │                        │
     │                    Score relevance          Exceeds threshold?
     │                    Drop below min_score          │
     │                         │                   ┌────┴─────┐
     │                    Write originals          Yes        No
     │                    to external store         │          │
     │                         │                 Compress    Continue
     └────────────────────────►│                 oldest N        │
                                │                messages        │
                           Inject filtered                       │
                           chunks only                          ►│
                                                          Next step

Note that external storage is written to at injection time, not just at compression time. Every piece of content that enters the window should have a durable record outside it before the agent proceeds.

📋 Technique Reference:

Technique	When applied	What it controls	Failure mode
Relevance filtering	Before injection	Incoming chunk size	Drops valid chunks if threshold too high
Summarization	When budget exceeded	History length	Loses detail; irreversible if originals discarded
External offloading	After tool output	In-window accumulation	Adds retrieval latency; misses if query is poor
Truncation	Emergency fallback	Hard overflow	Drops load-bearing context silently
Idempotent reconstruction	At step start	Rebuilding from storage	Stale storage produces stale context

Common Context Engineering Mistakes and How to Detect Them

Context engineering failures are rarely loud. An agent won't throw an exception when its context window fills up, won't log a warning when a critical constraint quietly disappears from memory, and won't surface a conflict report when two parallel agents start acting on divergent world-states. The failures manifest as degraded behavior: subtly wrong outputs, repeated work, violated constraints, or contradictory actions. Because the symptoms look like reasoning errors rather than engineering errors, they often get misattributed — teams assume the model is unreliable when the actual problem is that the model never had reliable information to work with.

This section catalogs five recurring mistakes, each with a description of what causes it, what it looks like in practice, and how to detect it.

Mistake 1: Unbounded History Accumulation

Unbounded history accumulation is the most common context mistake in agentic systems. The pattern is straightforward: every message gets appended to a growing list, and that list is passed wholesale to the model on every turn. In a multi-step agent loop that runs dozens of iterations, this is a slow memory leak.

The immediate consequence is context window exhaustion. The subtler consequence is that the earliest — often most important — instructions are the first to get pushed out. The agent ends up reasoning from recent, often low-value tool outputs while losing access to the task framing established at step one.

How to detect it: Log the token count of the full message list at each step. In a healthy system, this count should plateau or oscillate; in an unbounded accumulation system, it climbs monotonically until it hits the model's limit.

import tiktoken

def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """Estimate token count for a list of chat messages."""
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        total += 4  # per-message structural overhead
        for value in msg.values():
            total += len(enc.encode(str(value)))
    return total

def run_agent_loop(initial_messages: list[dict], max_steps: int = 20):
    messages = initial_messages.copy()

    for step in range(max_steps):
        token_count = count_tokens(messages)
        print(f"Step {step}: context = {token_count} tokens")

        if token_count > 80_000:  # adjust to your model's context size
            print(f"WARNING: Context budget at risk — pruning required")
            break

        # ... call the model, append response, append tool results

If the logged token counts climb steadily step over step with no flattening, you have unbounded accumulation. Plotting counts per step during development — not just at the end — identifies exactly where to introduce a pruning trigger.

Mistake 2: Injecting Full Documents Instead of Excerpts

Full-document injection happens when a retrieval step fetches a large document and inserts it wholesale into the context. A 20-page document might consume 15,000–25,000 tokens, representing the majority of the remaining context budget after instructions and history are accounted for. The model then has almost no room for the reasoning steps and tool outputs that follow.

The symptom is an agent that seems to understand retrieval-heavy tasks early in a run but produces shallow or truncated responses as the context fills.

❌ Wrong thinking: "The model needs all the context, so I should provide the whole document." ✅ Correct thinking: "The model needs the relevant portion. My job is to do the filtering so the model can do the reasoning."

def safe_inject_document(
    document_text: str,
    remaining_budget: int,
    max_injection_fraction: float = 0.25,
    tokenizer=None
) -> str:
    """
    Truncate a document to at most max_injection_fraction of the remaining budget.
    Logs a warning if truncation occurred.
    Note: raw token truncation may cut mid-sentence — use chunk-level
    retrieval as the primary solution, not this safety net alone.
    """
    if tokenizer is None:
        raise ValueError("Provide a tokenizer to count tokens accurately.")

    max_tokens = int(remaining_budget * max_injection_fraction)
    doc_tokens = tokenizer.encode(document_text)

    if len(doc_tokens) > max_tokens:
        print(
            f"⚠️ Document truncated: {len(doc_tokens)} tokens → {max_tokens} tokens. "
            f"Consider using chunk-level retrieval instead of full-document injection."
        )
        return tokenizer.decode(doc_tokens[:max_tokens])

    return document_text

This function is a safety net, not a substitute for proper retrieval design. The real fix is upstream: use embedding similarity or keyword overlap to retrieve chunks rather than full documents.

Mistake 3: Instruction Drift

Instruction drift is the most operationally dangerous mistake on this list because it causes agents to silently violate their own rules mid-task. Critical behavioral constraints — "never expose PII," "always confirm before deleting," "output must be valid JSON" — are written into the system prompt at the start of the run. But as the run progresses and the context fills, summarization passes or sliding-window truncation can drop the system prompt if it is treated as just another block of text in the message history.

Step 1:  [SYSTEM: constraints A, B, C] [USER: task] [ASSISTANT: response]
         ↑ all constraints present and active

Step 12: [SUMMARY: "The agent has been working on..."] [USER: ...] [ASSISTANT: ...]
         ↑ constraints A, B, C silently absent — agent has drifted

How to detect it: Run a constraint-audit probe at intervals during a long run. After each model call, make a separate lightweight call presenting only the constraints and the agent's most recent output, asking: "Does this output comply with all of the following rules?" A violation indicates drift.

How to prevent it: Keep the system prompt structurally separate from history — never allowing it to be part of a summarization pass. Most API-level message formats support a distinct system role precisely for this reason. For constraints so critical they must survive any compression scenario, re-inject them as a compact "constraint reminder" block at the start of each call.

💡 Real-World Example: An agent tasked with generating customer-facing copy has a constraint: "Never make specific pricing claims without citing a source." After 15 steps of research and drafting, a summarization pass collapses the early history. The constraint disappears. The agent generates copy with specific, uncited price figures. No error is raised. The output looks polished. The constraint was simply gone.

Mistake 4: Inconsistent State Across Parallel Agents

When multiple agents operate in parallel, inconsistent shared state becomes a structural hazard. Two agents both read a shared state snapshot at time T. Agent A writes an update at T+1. Agent B, still operating on the T snapshot, takes an action that is now contradictory with A's update. Neither agent has an error; both acted correctly given their information. The system as a whole is in conflict.

       ┌─────────────────────────────────────────────────┐
       │              SHARED STATE at T                  │
       │   { "budget_remaining": 5000, "items": [] }     │
       └──────────────┬──────────────────────────────────┘
                      │ both agents read at T
           ┌──────────┴──────────┐
           ▼                    ▼
      Agent A (context)    Agent B (context)
      budget=5000          budget=5000
           │                    │
      spends 4000          spends 3000
           │                    │
           ▼                    ▼
    writes budget=1000    writes budget=2000
           │                    │
           └──────────┬─────────┘
                      ▼
              CONFLICT: actual spend = 7000
              but recorded budget > 0

What makes this acute in agentic contexts is that the inconsistency lives inside opaque context windows, invisible to standard monitoring.

How to detect it: After any parallel agent step that involves writes to shared state, run a consistency check: re-read the shared state from the authoritative store and compare it to each agent's most recent state snapshot. A divergence between what an agent believes the state to be and what the store actually contains is a drift indicator.

The root cause is the absence of an explicit state-sync mechanism. Agents need to either treat shared state as write-once-read-many within a step boundary and re-sync at step boundaries, or use an external coordination layer — a lock, a versioned store, an event log — that enforces ordering. The context window cannot be the source of truth for shared mutable state across agents.

⚠️ Common Mistake: Designing a multi-agent system where each agent maintains its own "running summary" of shared progress. These summaries diverge immediately and there is no mechanism to reconcile them. Use a single external state store that all agents read from and write to atomically.

Mistake 5: Silent Token-Limit Truncation

Silent truncation is the mistake that makes all the other mistakes harder to diagnose. When a request exceeds the model's context limit, many API configurations will truncate the input without raising an exception. The call succeeds. The response looks plausible. But the model was reasoning over an incomplete picture, and nothing in the response indicates that. Each step's output then becomes the next step's input, compounding errors silently.

class ContextOverflowError(Exception):
    """Raised when a model call would exceed the safe context budget."""
    pass


def call_model_with_overflow_detection(
    messages: list[dict],
    model_context_limit: int,
    model_client,  # your model client instance
    tokenizer,
) -> dict:
    """
    Calls the model only if the token count is safely within the context limit.
    Raises an explicit error rather than allowing silent truncation.
    """
    token_count = sum(
        len(tokenizer.encode(str(m.get("content", "")))) for m in messages
    )

    # Reserve headroom for the model's response
    safe_limit = int(model_context_limit * 0.85)

    if token_count > safe_limit:
        raise ContextOverflowError(
            f"Context size {token_count} tokens exceeds safe limit {safe_limit}. "
            f"Apply pruning before this call. "
            f"(Model limit: {model_context_limit}, headroom reserved: 15%)"
        )

    return model_client.complete(messages=messages)

This wrapper converts a silent failure into an explicit one. The ContextOverflowError surfaces in logs, can trigger alerts, and forces the calling code to handle the situation deliberately rather than continuing with a degraded context. The 15% headroom is a practical heuristic to account for the model's response tokens; adjust it based on expected output length.

Putting the Diagnostics Together

Each mistake has its own detection signal, but in practice they tend to co-occur. A system with unbounded history accumulation will eventually trigger silent truncation; full-document injection accelerates the accumulation; instruction drift is the consequence of truncation hitting the system prompt. The mistakes form a failure cascade, not isolated incidents.

📋 Context Mistake Diagnostics:

Mistake	Symptom	Detection Method
Unbounded accumulation	Token count climbs per step	Log tokens per call over a full run
Full-document injection	No budget for tool outputs	Log retrieval result sizes vs. remaining budget
Instruction drift	Agent violates its own rules	Constraint-audit probe at intervals
Parallel state conflict	Agents produce contradictory actions	Compare agent state snapshots to authoritative store
Silent truncation	Subtle errors with no signal	Explicit overflow detection before each call

The underlying pattern across all five diagnostics is the same: make the invisible visible. Context state is not naturally observable — you have to instrument it deliberately.

Key Takeaways and What Comes Next

Everything covered in this lesson converges on a single organizing idea: context is a runtime-allocated resource with a hard ceiling, and every design decision in an agentic system either respects that ceiling or collides with it.

The Four Control Levers: A Consolidated Reference

This lesson introduced four practical mechanisms for keeping context within budget, each operating at a different point in the data lifecycle:

  External sources
  (docs, DB, memory)
         │
         ▼
  ┌─────────────┐
  │  FILTERING  │  ← Select only relevant chunks before injection
  └──────┬──────┘
         │
         ▼
  ┌──────────────┐
  │ COMPRESSION  │  ← Summarize or condense what is already in-window
  └──────┬───────┘
         │
         ▼
  ┌──────────────────┐
  │  CONTEXT WINDOW  │  ← Active token budget (instructions + history
  │                  │    + retrieved data + state)
  └──────┬───────────┘
         │
         ▼
  ┌────────────────┐
  │   OFFLOADING   │  ← Write intermediate results to external storage
  └──────┬─────────┘
         │
         ▼
  ┌──────────────────────┐
  │   RECONSTRUCTION     │  ← Rebuild context from storage on demand
  └──────────────────────┘

These four levers are not mutually exclusive — production agent systems typically apply all of them in combination. A worked example of them operating together:

from typing import Callable

def agent_step(
    system_prompt: str,
    history: list[dict],
    new_tool_output: str,
    retrieve_fn: Callable[[str, int], list[str]],  # (query, max_chunks) -> chunks
    summarize_fn: Callable[[str], str],            # text -> summary
    offload_fn: Callable[[str, str], None],        # (key, value) -> None
    query: str,
    max_tokens: int = 8192,
    compression_threshold: float = 0.75,
) -> list[dict]:
    """
    One step of a context-aware agent loop demonstrating filtering,
    compression, and offloading working together.
    Simplified for clarity — production use would also handle
    parallel-agent state sync and idempotent reconstruction.
    """
    # Step 1: Filter — retrieve only the top 3 relevant chunks
    retrieved = retrieve_fn(query, max_chunks=3)

    # Step 2: Audit the current budget
    # (uses count_tokens and build_and_audit_context from earlier)
    history_text = "\n".join(m["content"] for m in history)
    retrieved_text = "\n".join(retrieved)
    total_used = (
        count_tokens(system_prompt)
        + count_tokens(history_text)
        + count_tokens(retrieved_text)
        + count_tokens(new_tool_output)
    )

    # Step 3: Compress if approaching threshold
    if total_used / max_tokens > compression_threshold:
        compressed = summarize_fn(history_text)

        # Step 4: Offload the full history before replacing it
        offload_fn(key="history_snapshot", value=history_text)

        history = [{"role": "system", "content": f"[History summary]: {compressed}"}]
        print("[context] History compressed and offloaded.")

    history.append({"role": "tool", "content": new_tool_output})
    return history

The offload step is what makes reconstruction possible — the original history is not discarded, just moved out of the window.

Why Monitoring Is Non-Negotiable

Context overflow does not look like a crash. It looks like gradual degradation: an agent that starts ignoring earlier constraints, repeats work it has already done, or asserts facts about state that have since changed. This asymmetry — hard failure signals are easy to detect; soft degradation signals are not — is why explicit token-count monitoring is a baseline operational requirement, not an optional debugging aid.

import tiktoken
from dataclasses import dataclass
from typing import Any

@dataclass
class ContextBudget:
    system: int
    history: int
    retrieved: int
    state: int
    max_tokens: int

    @property
    def total_used(self) -> int:
        return self.system + self.history + self.retrieved + self.state

    def log(self) -> None:
        pct = (self.total_used / self.max_tokens) * 100
        print(f"[context] system={self.system} history={self.history} "
              f"retrieved={self.retrieved} state={self.state} "
              f"total={self.total_used}/{self.max_tokens} ({pct:.1f}%)")
        if pct > 80:
            print("[context] ⚠️  WARNING: context budget above 80% — "
                  "consider compressing history or offloading state")


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens using tiktoken for OpenAI-compatible models.
    For other model families, substitute the appropriate tokenizer.
    Treat the result as a close approximation; special tokens added
    by the API are not included."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))


def build_and_audit_context(
    system_prompt: str,
    history: list[dict[str, Any]],
    retrieved_chunks: list[str],
    state_summary: str,
    max_tokens: int = 8192,
) -> ContextBudget:
    history_text = "\n".join(m["content"] for m in history)
    retrieved_text = "\n".join(retrieved_chunks)

    budget = ContextBudget(
        system=count_tokens(system_prompt),
        history=count_tokens(history_text),
        retrieved=count_tokens(retrieved_text),
        state=count_tokens(state_summary),
        max_tokens=max_tokens,
    )
    budget.log()
    return budget

Add this audit call to your agent loop's entry point before you touch any compression or filtering logic. Running it first tells you what the unmanaged budget looks like — the baseline you need before you can evaluate whether your management strategies are actually helping. The 80% threshold is a reasonable starting default; calibrate it based on how much headroom your agent's reasoning and tool outputs typically consume.

What You Now Understand That You Did Not Before

When you read an agent implementation now, you will instinctively look for three things that were not on your radar before:

Does this agent track how many tokens each component is consuming, or does it grow history unboundedly?
When retrieved data enters the window, is it size-bounded before insertion, or does it insert whole documents?
When the agent runs for many steps, is there a compression or offloading strategy, or does the window simply fill until something breaks silently?

The absence of these three practices is a reliable signal that the system will degrade under realistic workloads — not if, but when the context fills.

Where the Sub-Topics Take You Next

🔧 Five Context Layers formalizes the structural overview introduced in What Context Actually Contains into a precise layering model, giving you a standard vocabulary for describing and diagnosing context composition across different agent architectures.

🧠 Attention Budgets and Failure Modes goes deeper on the degradation mechanics touched on in Common Context Engineering Mistakes — specifically, how attention degrades as context fills and which failure patterns emerge at different fill levels.

📄 AGENTS.md / CLAUDE.md shows how to encode context policy as version-controlled project files — treating context management decisions as code artifacts that live in the repository alongside the agent implementation, rather than as ad hoc runtime choices.

📋 Lesson Summary:

Concept	Core Idea	Practical Action
Context window	Fixed token budget shared by all components	Track per-component token counts at every call
History growth	Fastest-growing component; primary overflow source	Apply compression or offloading before 75–80% fill
Filtering	Reduce what enters the window at retrieval time	Use similarity scoring; never insert whole documents
Compression	Shrink content already in-window	Summarize history segments; preserve constraints explicitly
Offloading	Move state to external storage	Write before compressing so reconstruction is possible
Reconstruction	Rebuild context from storage on demand	Design state so any step can be replayed from scratch
Silent failure	Overflow degrades output, not execution	Monitor token counts; never rely on API silent truncation
Context as code	Policy decisions belong in version-controlled files	Use AGENTS.md / CLAUDE.md to codify context rules

🧠 Mnemonic: FCOR — Filter before injection, Compress what is there, Offload to storage, Reconstruct on demand. These four operations cover the primary management lifecycle for most agent context budgets; complex multi-agent systems with shared state will require additional coordination mechanisms beyond this foundation.

The goal of everything in this lesson is not to make context management invisible. It is to make it legible — to move context from an implicit, unmonitored resource that surprises you at the worst moment into an explicit, observable budget that behaves predictably under load.

📝

Ready to practice?

This lesson has 15 questions to help you learn