Jun 13, 2026 · 11 min · Dev Guides

Managing Context in Long-Running Claude Agents: Tool Search, Context Editing & Compaction

MR By Marcus Reed · Senior API Engineer

A Claude agent that runs for dozens of steps faces one inevitable enemy: its own context window. Every tool call adds a result, every turn adds history, and eventually the window fills with material that’s no longer relevant. Three techniques — tool search, context editing, and compaction — keep long agents running reliably and cheaply. Here’s how to use them, and how they combine with prompt caching.

Why long agents overflow

The window doesn’t fill with your prompt — it fills with accumulation:

Tool definitions, especially when you expose many tools “just in case.”
Tool results, which can be large (file contents, API responses, search output).
Conversation history that grows every turn.

Past a certain point you either truncate (losing information) or pay to carry dead weight every single step. The fix is to manage what’s in the window deliberately.

1. Tool search: load tools on demand

If your agent has 30 tools but uses 3 per task, defining all 30 on every request wastes input tokens and clutters the model’s choices. Tool search lets the agent discover and load tools when it needs them instead of carrying the full catalog upfront.

Keep a small core of always-loaded tools.
Expose the rest through a search/lookup mechanism the model can query.
The window only ever holds the tools currently in play.

This alone can dramatically shrink the fixed overhead of a tool-heavy agent.

2. Context editing: remove stale tool results

Once a tool result has been used, it often doesn’t need to stay verbatim in the window. Context editing removes or condenses stale tool_result blocks while preserving the reasoning that depended on them.

After the agent has acted on a large result, drop the raw payload.
Keep a short note of what it contained if later steps reference it.
Prune oldest-first as the window approaches its limit.

The art is removing bytes without removing meaning — keep the conclusions, drop the raw data.

3. Compaction: summarize and continue

When the window is genuinely near full, compaction summarizes the conversation so far into a compact form and continues from there. Instead of hitting a hard wall, the agent carries forward a distilled memory.

Trigger compaction at a threshold (e.g., when the window is ~80% full).
Summarize goals, decisions made, and open tasks — not every keystroke.
Resume with the summary plus only the live working set.

Compaction is what lets an agent run effectively “forever” instead of dying when it hits the context ceiling.

Combining them with prompt caching

These techniques interact with caching, and the order matters:

Keep your stable prefix (system prompt, core tools, instructions) at the front so it stays cached.
Apply context editing and compaction to the volatile middle/end of the window.
Avoid editing inside the cached prefix — any change there invalidates the cache and re-charges full input price.

Used together, you get a long agent that holds only what’s relevant, reads its stable prefix from cache cheaply, and never falls off the end of its window.

Putting it together

A robust long-agent loop looks like:

Load a small core toolset; expose the rest via tool search.
After each step, context-edit away the raw tool result you no longer need.
Monitor window usage; when it crosses your threshold, compact.
Keep the cacheable prefix immutable so cache hits stay high.

Cost angle

All three techniques reduce input tokens, which compounds with the rate you pay per token. Running the agent through a discounted gateway like AI Prime Tech — same Claude models, up to 80% off official pricing, one key across Opus 4.8, Sonnet 4.6 and Haiku 4.5 — means a well-managed long agent costs a small fraction of the naive version that carries its full history every step. Lean context plus a lean rate is how you run agents at scale without the bill scaling with them.

Marcus Reed · Senior API Engineer

Marcus has spent 9 years building LLM-backed products and integrating the Claude, GPT and Gemini APIs into production systems. He writes about API cost optimization, agent architecture, and practical model selection.

Get cheaper Claude API access

One API key for Claude Opus 4.8, Sonnet 4.6, Haiku 4.5, Fable 5, plus GPT & Gemini — up to 80% off official pricing, pay-as-you-go.

Get Your API Key →

AI Prime Tech is an independent third-party API gateway. Claude™ and Anthropic® are trademarks of Anthropic, PBC. No affiliation or endorsement is implied.