Same agent. Same model.
A fraction of the bill.

Long-running tool-using agents accumulate verbose outputs in their transcripts — shell logs, file reads, test results, browser traces, intermediate plans. By default these are appended directly to the conversation, producing context pressure that drives cost and latency up without proportionally improving decisions. Espada attacks this at the orchestration layer, between the agent and the model, with eight composable, model-agnostic reducers. Most are agent-transparent: a normal tool call yields a normal-looking result; the runtime decides how it persists and how it is resent on later turns.

  • 81–93%

    Lower cumulative tokens

    Across long, tool-heavy agent sessions.

  • 80–99%

    Per-artifact reduction

    On verification logs, workflow envelopes, historical results.

  • $10 → ~$1

    Typical task cost

    Same task, same model, same agent — Espada in the middle.

The eight reducers

  1. 01

    Skill routing

    Inject only the top-ranked skills for the active prompt. Ambiguous prompt? Keep the full set — capability never silently disappears.

  2. 02

    Tool-result shaping

    Rewrite workflow_run JSON envelopes into a dense text shape. Keys appear once per envelope, not once per operation.

  3. 03

    Verification compression

    Detect test, lint, and build logs. Strip routine progress on pass; preserve failure signal verbatim — assertions, errors, expected vs received.

  4. 04

    Transcript virtualization

    Old, large tool results are swapped for compact stubs that point back to the original. The freshest one or two turns stay full-fidelity.

  5. 05

    Lossless raw refs

    When a result is shaped, the raw text is persisted synchronously to an append-only sidecar. Nothing is dropped — only deferred.

  6. 06

    Targeted retrieval

    context_ref_get resolves a saved reference by default, by line range, or by substring query. Bounded reads, no surprise blow-ups.

  7. 07

    Run-state store

    A session-scoped key-value tool. The agent stops pasting plans, intermediate JSON, and partial results back into the transcript.

  8. 08

    Argument compaction

    Even run_state set arguments can be huge. The runtime compacts the on-wire value so the persisted transcript stays small.

How it composes

Espada's reducers sit at the runtime/orchestration layer. They compose with everything else you're already doing to keep context cost down — they don't replace it.

  • Provider prompt caching

    Anthropic and OpenAI prompt caching reduces unit cost of unchanged prefixes by up to ~90%. Espada attacks volume. The two compose: shorter shaped tail content preserves more of the cacheable-prefix budget. Historical virtualization runs at indices already outside the typical cache window.

  • Contextual retrieval

    Anthropic Contextual Retrieval operates over an external knowledge base. Espada's reducers operate over the agent's own past tool outputs within a single session — a different object. A deployment can use both at the same time.

  • Prompt-token compression

    LLMLingua and Selective-Context are provider- and task-agnostic by design. They can't exploit the structure of a workflow_run envelope or distinguish 0 failed from failed. Espada's structural shapers can. Token-layer compression can be applied on top of shaped output.

All eight reducers are independently disable-able for ablation. Per-fixture studies, the full loss model, and the reproducibility protocol — corpus, baselines (LLMLingua-2, Selective-Context, naive truncation, single-LLM summary), provider tokenizers, and task-quality benchmarks — are in the research paper. Read the full design

Ready when you are.

One binary. One install. One hour to your first signed action.