Claude Code, OpenCode, and π (pi): anatomy of a trivial request

Me included in Software Development Artificial Intelligence

May 18, 2026 2882 words 14 minutes

Contents

Intro

Anyone following the evolution of coding agents closely has probably heard about pi, the minimalist harness that became popular in part because it is one of the components behind OpenClaw.

One of the arguments in favor of pi’s minimalism is the contrast with the supposed “heaviness” of Claude Code, meaning an excessive use of tokens to carry out even trivial tasks. In that debate, pi and other harnesses built around the same philosophy are carving out some space.

Personally, I’m a happy OpenCode user, but I’m also curious and always looking for the best possible setup, so I wanted to understand this better. Since I don’t trust everything I read, I decided to verify for myself the concrete differences in the payload sent to the model for the simplest task possible: what actually gets sent to the model on the first turn as a baseline?.

I wasn’t trying to run a benchmark, let alone analyze the productivity impact of one agent versus another. I just wanted to look at the payload generated by the three most popular agents on a trivial task, and understand where the tokens come from, how they are spent, and whether “pi is 10x cheaper” is really true in the way people casually repeat online.

So I set up the simplest possible scenario: a fake project, three coding agents pointed at the same model on Bedrock through LiteLLM as a gateway, and a single prompt:

just say “hello”, nothing else

I then captured the full request, the response, and the metadata at the highest verbosity level for each harness. The numbers alone are already interesting. The shape of the payloads is even more interesting, and worth discussing.

The conclusion I came away with is this: raw input token count is not a meaningful benchmark, especially when you compare harnesses in their “plain-vanilla” form, without customizations, skills, tools, or integration with each ecosystem.

What really matters is the mix of tools the harness exposes, the steering it embeds into tool descriptions, and how effectively it takes advantage of model and provider features such as prompt caching. Pi is genuinely cheap on a single interaction. It is also the only one that, because of an AWS API constraint, silently skipped prompt caching, and the only one that sends effectively zero guardrails inside its tools — two choices most teams will have to think through before using it with a team and on a real project.

The setup

The fake project is simply a folder containing this AGENTS.md:

This is an example project to make some experiment with coding agents. Key tech stack:
 - python (uv)
 - shell (zsh)
 - aws cli

All three agents were configured to talk to the same model — eu.anthropic.claude-haiku-4-5-20251001-v1:0 on Bedrock — through a LiteLLM proxy with full request and response logging. So: same model, same provider, same tokenizer, same caching backend. The only thing that changes is the harness itself.

Another important detail: for all three harnesses I removed any MCP tool, skill, prompt, or additional agent. So we’re looking at their “out-of-the-box” configuration.

The flow looks like this:


flowchart LR
    subgraph CLI["Coding agent (CLI)"]
        H[Harness:
system prompt + tools + steering]
    end
    H -->|OpenAI-style request| G[LiteLLM Gateway]
    G -->|converse API| B[AWS Bedrock]
    B --> M[Claude Haiku 4.5]
    M --> B --> G --> H
    G -.->|verbose log:
request / response / metadata| L[(JSON files)]

For each agent I saved three artifacts:

request.json — the full body sent to the model (system, tools, messages, caching hints)
response.json — what came back
metadata.json — LiteLLM’s view of usage and costs

You can download the sanitized payloads below if you want to inspect them yourself:

Claude Code — request · response · metadata
OpenCode — request · response · metadata
pi — request · response · metadata

The numbers that jump out immediately

Here is what each harness actually sent for that single “hello” turn:

Harness	Tool	Input token	Token cache-creation	First-turn cost (USD)
Claude Code	27	28.407	28.404	$0,0391
OpenCode	12	12.374	12.371	$0,0170
pi	5	2.768	0	$0,0031

So pi is roughly 10 times cheaper than Claude Code on a cold “hello”, and Claude Code is 2.3x more expensive than OpenCode. If we stopped here, the conclusion would write itself: pi wins, OpenCode is the right compromise, Claude Code is the bloated one. That’s more or less consistent with what I’ve been reading around.

The problem is that the first cold turn is not the regime in which these agents normally operate.

Why the “cheap on paper” number is misleading

Anthropic prompt caching on Bedrock applies a 1.25x surcharge to create a cache entry, and then charges 0.10x of normal input pricing to read it back. That changes the economics dramatically as soon as you move past the first turn in a session.

For Haiku 4.5 (input $1.10/M, cache write $ 1.375/M, cache read $0.11/M), the per-turn cost of sending only system prompt + tools + project context looks like this:

Harness	First turn (cache write)	Steady state (cache read)	Steady-state vs pi
Claude Code	$0,0391	$0,00312	1,01x
OpenCode	$0,0170	$0,00136	0,44x
pi	$0,0031	$0,00310	1,00x

In other words, if we look at a second turn with the same minimal request/response shape, caching kicks in for OpenCode and Claude Code, and at that point OpenCode is cheaper per turn than pi, while Claude Code costs roughly the same as pi even though it sends ten times more tokens across system prompt and tooling.

Obviously, once you scale to real usage with user requests in the hundreds or thousands of tokens, all three harnesses operate in steady state with prompt caching enabled, so this initial gap gets heavily smoothed out. Even so, I would still expect pi to remain cheaper overall in steady state (at least in its “plain-vanilla” version).

Ok, but why didn’t pi use the cache?

The answer is in the official documentation from Anthropic and AWS Bedrock, both of which explain prompt caching. In both cases, the docs clearly say there is a minimum prompt size required before prompt caching activates. For Haiku 4.5, that threshold is 4096 tokens, which is almost twice what pi sent to the LLM.

Is this really overhead?

If we set caching aside, the next question is obvious: where do Claude Code’s 28k tokens come from?

The common and most immediate assumption is that they come from the System Prompt, but that is only partly true.

In practice, while the System Prompt is fairly verbose, it is not the dominant component.

Character/token breakdown across all components

Fortunately, LiteLLM’s logs give us all the detail we need to investigate this properly.

Component	Claude Code	OpenCode	pi
System prompt	~26,5k char (~6,6k tok)	~9,5k char (~2,4k tok)	~3,5k char (~0,9k tok)
Tool catalog (27 / 12 / 5)	~75,6k char (~18,9k tok)	~38,6k char (~9,7k tok)	~4,5k char (~1,1k tok)
User-side context (system reminder, AGENTS.md, prompt)	~4,0k char (~1,0k tok)	~30 char (~7 tok)	~30 char (~7 tok)
Declared total	28.407	12.374	2.768

The System Prompt is fairly long for both Claude Code and OpenCode, but it is about one third of the size of their tool definitions. In both cases, the tool catalog is by far the biggest item in the token budget.

Looking more closely at the tools themselves, two other things stand out:

Number of tools. Claude Code sends 27 tools out of the box, including things like Agent, EnterPlanMode, EnterWorktree, CronCreate, Monitor, PushNotification, RemoteTrigger, ScheduleWakeup, TaskCreate/Update/Stop. Many of these are basic and nearly indispensable product features (sub-agents, cron jobs, scheduled wakeups, notifications). OpenCode is more conservative with 12 tools. Pi exposes only 5 tools: read, bash, edit, write, mcp.
Description length. The same tool name carries radically different weight depending on the harness. The clearest example is bash:

Harness	`bash` description length
Claude Code	10.637 char
OpenCode	9.936 char
pi	248 char

Claude Code and OpenCode both inject roughly 10 KB of policy into a single tool description. Pi’s entire bash description, by contrast, is extremely bare-bones:

Execute a bash command in the current working directory. Returns stdout
and stderr. Output is truncated to last 2000 lines or 50KB (whichever is
hit first). If truncated, full output is saved to a temp file. Optionally
provide a timeout in seconds.

We’re talking about a 40x difference, so at that point the natural question becomes whether those extra 10 KB are really just overhead, or whether in practice they are essential for a modern coding harness.

The guardrails in Claude Code and OpenCode

Reading the bash descriptions side by side, it’s hard to argue that this is token waste. This is not verbose documentation. These are real usage policies, embedded exactly where the model is most likely to use them. For example, here is a representative excerpt from Claude Code’s bash tool:

Avoid using this tool to run cat, head, tail, sed, awk, or echo unless
explicitly instructed... Use Read / Edit / Write instead.

Try to maintain your current working directory throughout the session by
using absolute paths and avoiding usage of cd... never prepend
cd <current-directory> to a git command — git already operates on the
current working tree, and the compound triggers a permission prompt.

Long leading sleep commands are blocked. To poll until a condition is met,
use Monitor with an until-loop... Do not chain shorter sleeps to work
around the block.

Git Safety Protocol:
- NEVER update the git config
- NEVER run destructive git commands (push --force, reset --hard, ...)
  unless the user explicitly requests them
- NEVER skip hooks (--no-verify) ...
- NEVER run force push to main/master ...
- Always create NEW commits rather than amending ...

When staging files, prefer adding specific files by name rather than using
"git add -A" or "git add .", which can accidentally include sensitive
files (.env, credentials) or large binaries

OpenCode’s bash description is structurally similar: directory verification rules, quoting examples, an explicit <good-example>/<bad-example> block for workdir vs cd, the same git safety protocol, the same “no --no-verify” rule, and so on.

Pi’s bash description has none of that. It says “execute a bash command” and tells you the output gets truncated. No concept of git safety, no preference for dedicated read/edit tools, no rule against git add -A, no warning about destructive operations.

You can read that in two ways:

“Pi is clean and lets you build your own policies on top.”
“Pi gives the model a 200-character invitation to do whatever it wants in your shell.”

Both are true. Which one matters depends on who is using it. For a senior engineer driving a sandboxed VM, minimalism is great. For a team where the agent might run on a laptop with production credentials, the lack of prevention around git push --force or warnings about sensitive files is a problem you will eventually rediscover the hard way.

In other words: those 10 KB of tool description in Claude Code and OpenCode are not bloat — they are a mix of Guardrails and steering. They are how the harness encodes the dozens of little “don’t do this” rules that a real engineering team has internalized after watching agents misbehave for two years.

So the comparison is not really “verbose vs lean.” It is “shipped with guardrails vs shipped without them.”

What’s outside the tool catalog

A few smaller observations are worth calling out.

System reminder. Claude Code’s user message is not just just say "hello". It carries two <system-reminder> blocks (~3 KB and ~0.9 KB) that re-inject the list of available skills and the contents of the project’s AGENTS.md / CLAUDE.md, plus today’s date. This is a deliberate context-engineering choice: instead of relying on the model to remember rules buried inside the system prompt, the harness restates the most operationally relevant pieces right next to the user turn. OpenCode bakes the project’s AGENTS.md directly into the system prompt itself (which is why its tool catalog is smaller while its system block is still non-trivial). Pi does the same inside its system block, along with a “Pi documentation” block pointing to local docs.

Skill discovery. All three harnesses inject some notion of “available skills” into the prompt, but they do it differently:

Claude Code lists user-invokable skills in a system reminder so they get rediscovered every turn.
OpenCode lists them under <available_skills> inside the system prompt, with a description of when to load each one.
Pi lists them at the end of the system prompt and asks the model to read SKILL.md files relative to the skills directory.

The patterns differ, but the design is broadly convergent and consistent with what I already knew about skills: they are cheap pointers injected into the prompt, while the actual body is loaded at runtime and on demand.

Memory and persistence. Claude Code’s system prompt includes a substantial section on a file-based memory system (~/.claude/projects/...), with explicit categories (user, feedback, project, reference) and rules on what to save and what not to save. That alone weighs about 5 KB of system prompt. OpenCode has no equivalent block in the payload I captured. Pi has nothing like it.

This is not really a “verbosity” decision. It’s a product decision: Claude Code is shipping a cross-session memory feature and paying the prompt cost required to make it work. If you don’t want the feature, that prompt is overhead. If you do want it, building it on top of pi means paying that same prompt cost yourself.

Tool-choice policy. OpenCode is the only one that explicitly sets tool_choice: auto in the request. Claude Code and pi rely on provider defaults. Minor detail, but worth noting if you’re debugging non-deterministic tool-call behavior.

A small note on the response

The response side is genuinely uninteresting, and that’s exactly the point. All three models replied with hello (Pi capitalized it: Hello) in 4 completion tokens. The only pseudo-finding here is that pi’s response was capitalized even though the prompt explicitly said just say "hello". With a sample size of one and a 4-token response, that’s noise. I’m mentioning it only because I checked that side too and found nothing :).

What I take away from this

1. Token counts in the bigger picture

The interesting comparison is not “how many input tokens does the harness send” but “how many of those tokens are cached, how many encode guardrails, and which product features come along with them.”

2. The verbosity of tool descriptions

The 10 KB that Claude Code and OpenCode place inside their bash descriptions encode git safety, rules around destructive actions, and preferences such as “use the right tool.” This is not “bloat,” and removing it does not make the agent leaner; it makes it less safe. If you want to use a coding agent in the real world, even starting from a minimalist harness like pi, you will end up rebuilding most of those rules anyway — either as an additional system prompt, or in AGENTS.md, or as middleware around the gateway.

3. Pi’s minimalism

Pi’s lightness comes from a default setup that does not survive contact with reality.

I genuinely struggle to imagine working without guardrails on high-risk tools such as bash, but also write and others. Once you start integrating those missing pieces, pi begins to look a lot more like OpenCode, maybe just with fewer default tools.

4. OpenCode remains my favorite

It does not have pi’s minimalism, but it is less verbose than Claude Code, and it has safety guidance in the tool definitions. In my view it shows a similar level of attention to detail as Claude Code, while keeping a system prompt that is roughly one third the size. Compared with Claude Code, what also stands out is the absence of potentially unnecessary product features (sub-agents, scheduling, monitor, memory), and that is exactly why its prompt is smaller. If you don’t need those features, it’s a fair tradeoff.

5. Claude Code: I wouldn’t call it bloat

Even though the System Prompt and tool descriptions are clearly more verbose, most of the extra tokens encode product features and rational design choices: a memory system, scheduled tasks, sub-agents, plan mode, worktree support. Whether those features are worth paying for depends on your needs. Calling the prompt “bloated” without looking at the whole picture feels wrong to me.

Limits of this experiment

It’s worth being explicit about what this is and what it isn’t:

This is a single turn on a fake project. It captures the harnesses’ cold-state behavior, not how each agent behaves on a real coding task with many tool calls and subsequent turns.
I did not run the agents on real coding work, so I am making no claims about output quality, latency, or success rate.
LiteLLM metadata and provider-reported token counts are taken as they are; small accounting differences between Anthropic, Bedrock, and LiteLLM are possible.
All three harnesses are moving fast. Exact tool catalogs, system prompts, and caching defaults will change across releases. Treat the specific numbers here as a May 2026 snapshot.

References

LiteLLM as a unified LLM gateway: docs.litellm.ai
Anthropic prompt caching, including write/read price multipliers: Anthropic docs — Prompt caching
Bedrock prompt caching: AWS Bedrock — Prompt caching
Claude Code: claude.com/claude-code
OpenCode: opencode.ai
pi coding agent https://pi.dev/
Raw payloads from this experiment: claude, opencode, pi