Rethinking My AI Coding Setup: Cold Starts, Pi, and Lazy-Loading

Every token in the system prompt is a token you wait on before the model responds to anything you actually typed, and it’s also a token you’re paying for on every single request whether you use it or not. I got interested in this after trying to reduce prefill latency on OpenCode, where cold start for my setup sat around 15,000 tokens before I’d typed a word. pi’s whole pitch is the opposite of that, so I expected switching to make the habit of tracking this unnecessary. It didn’t. After setting it up with packages and skills, my own cold start on pi was back up to 11,370.

The argument for less

Two posts by Mario Zechner, the creator of pi, shaped how I think about this now. The first argues that most MCP servers are overkill: a typical browser-automation server ships with 20+ tools and 13,000-18,000 tokens of schema, when a handful of bash scripts (start the browser, navigate, evaluate JS, screenshot) do a similar job for a fraction of the cost.

The second walks through pi’s own design. The system prompt plus the four default tools (read, write, edit, bash) comes in under 1,000 tokens total. No built-in todo list, no plan mode, no sub-agents, no MCP support. The argument, backed by benchmarks in the post, is that frontier models are trained enough to understand what a coding agent is without 10,000 tokens of scaffolding explaining it to them. You add what you need. You don’t pay for what you don’t use.

Under 1,000 tokens is where I started. It is not where I ended up.

What I’ve installed

The idea behind pi is that it starts minimal, but you can augment it as your workflow and preferences demand. As part of this, there is a Package Catalog where you can find new capabilities for pi. I use a few:

pi-web-access for search and content extraction.
pi-agent-browser-native for more complex browser automation.
pi-fork for spawning child agents on noisy investigation work (needs installing from linked GitHub repo).
pi-codex-goal for goal tracking that survives resume and compaction.
pi-observational-memory to keep the agent from forgetting decisions across a long session’s compactions.
@hyperbolic/pi-hypa - for deterministic, rule-based compression of shell, file, and search output before it reaches the model (no LLM summarizing involved).

None of these are wasteful on their own. Installed together and loaded on every session regardless of whether that session needs them, they’re most of the 11,370 tokens.

CLI and scripts over MCP, skills over frameworks

The pattern I’ve settled on for anything new: one action with a few parameters gets a bash script and a skill documenting it. An entire API surface with auth and dozens of operations gets MCP, connected on demand rather than loaded at startup.

Almost all the skills in my harness set disable-model-invocation: true in their front matter, which keeps them out of the context on every session. The agent can’t invoke one on its own because it never sees the description, but I can still trigger it with a prompt or /skill:skill-name.

The YouTube downloader is the first custom skill worth noting. The popular yt-dlp CLI has enough surface area (format selection, cookies for age-gated videos, output paths) that letting the agent (or, let’s be honest, a human) figure out the right flags means a bit of trial and error. So I had my agent write a standalone Python script that uses yt-dlp and abstracts my common use-cases into fixed functions. The script encodes that logic once, so the agent runs it correctly the first time instead of debugging its way there. The SKILL.md is just the manual for that script: which functions map to which scenario, what to confirm with me before running. No server to maintain, no schema to keep in sync, and none of it sits in the prompt except while the skill is loaded.

Linear is the second interesting skill, and it’s the one that complicates the CLI-over-MCP framing above, so it’s worth being precise about what’s actually happening. Issues, projects, cycles, milestones, comments, labels, diffs: writing scripts for all of that means maintaining a pile of overlapping auth and error handling, so I didn’t. I still use the real Linear MCP server, I just don’t register it with pi at session start. Instead I call it through Apify’s mcpc MCP CLI client, which opens the connection, invokes a single tool, and exits. The distinction that actually matters isn’t “MCP vs. not”; it’s whether a tool’s schema sits in the system prompt for the whole session or only gets paid for at the moment it’s used. mcpc gets me the second without writing my own client. The custom Linear skill is just a short tutorial on mapping my intent to mcpc invocations and Linear’s tool names.

Workflows got the same treatment. I adapted three process skills (brainstorming, systematic debugging, writing plans) from obra/superpowers, a set of skills to enforce a standard software development workflow. pi’s skill system is just markdown injected into the prompt, so I stripped these to their essential instructions with no dependency on each other, and turned them off by default instead of always-loaded. They get invoked manually when a task actually calls for that workflow. The superpowers repo has a dozen skills; I use three, trimmed, and the other nine simply aren’t in my prompt, same logic as the MCP servers I don’t load at startup.

The numbers

Bare pi: under 1,000 tokens. My cold start with everything installed: roughly 11,370. That gap is entirely self-inflicted, six packages and several skills, all loaded whether or not a given session touches a browser, a fork, or a goal tracker.

I didn’t want to uninstall anything, but I did want to stop paying for tools sitting idle. So I had pi build a lazy-loading extension, ~/.pi/agent/extensions/lazy-tools.ts, that makes the heaviest tools opt-in instead of default-on: fork, agent_browser, and the three pi-codex-goal tools (create_goal, get_goal, update_goal). Together they account for most of the gap between pi’s 1,000-token floor and my 11,370-token cold start. The extension does six things:

A session_start hook diffs the active tool list against that heavy list and strips matches before the first prompt is assembled, so the leaner prompt is what the model sees from turn one, not something patched in after the fact.
A lightweight enable_tool gateway, roughly 100 tokens, stays registered so the model always has a way to ask for more. Its parameter schema is a union over the three nicknames (fork, browser, goals), which map to one or several actual tool names. Enabling goals, for instance, restores all three goal tools at once. Ask for a nickname that doesn’t exist and it hands back the real list of options instead of a dead end.
Slash commands for me (/enable fork, /enable browser, /disable all) to load a tool ahead of time when I already know a session will need it, or to tear everything back down mid-session once I’m done with it.

The session-start hook is the core of it:

const HEAVY: Record<string, string[]> = {
  fork: ['fork'],
  browser: ['agent_browser'],
  goals: ['create_goal', 'get_goal', 'update_goal'],
}

pi.on('session_start', async (_event, ctx) => {
  const active = pi.getActiveTools()
  const lean = active.filter(t => !Object.values(HEAVY).flat().includes(t))
  if (lean.length < active.length) {
    pi.setActiveTools(lean)
  }
})

Enabling a heavy tool only affects the session you enabled it in. The next fresh session starts stripped again. That’s the point, not a gap to patch.

A fresh session that might need forking later but doesn’t yet now starts around 4,350 tokens instead of 11,370, a 62% cut. That’s still more than four times pi’s own floor, because Hypa, web access, and observational memory stay active by default and I’m fine paying for those every session. Enabling a heavy tool costs for the rest of that session only.

Being intentional

pi starting under 1,000 tokens doesn’t mean much if I spend the next six months undoing it one package at a time. The lazy loader is a fix for how I use pi, not for pi itself, and it’s the same discipline that OpenCode never forced on me at 15,000 tokens. Less scaffolding riding along by default, more room left for the part of the context that’s actually about the problem in front of me.