200k vs 1M Claude Code: six prompts, no headline, and one good reason to switch anyway

A friend in my WordPress mastermind switched his Claude Code default from 1M context back to 200k on a hunch that 200k routes to "smarter servers." I built a quick eval harness to actually test it. The data was inconclusive. Here's why I switched anyway — and it's not what the eval was about.

A thread popped off in my WordPress business mastermind Slack today and I ended up writing a small bash harness to A/B test something Jason Coleman said. The eval was inconclusive.

The point of running it wasn’t.

Where the post came from. Jason’s framing about 1M for long horizons and 200k for the “one absolute smartest pass” tasks — that’s the bit I went and tested.

Katie kicked things off asking 4.7 vs 4.6 for coding. Short version of my reply: 4.7 is better generally, and the workflow stuff matters more than the model bump — I’d just published the multi-stage review post the day before, that has more on the workflow side. Where Jason took the conversation is what hooked me.

He’d been quietly defaulting Claude Code back to 200k context. The thing he’d added to his settings:

"env": {
  "CLAUDE_CODE_DISABLE_1M_CONTEXT": "1"
}

Plus a per-call override when he wants the long window for a specific session: CLAUDE_CODE_DISABLE_1M_CONTEXT=0 claude. His reasoning, paraphrased: 200k kind of feels smarter — folk wisdom that 200k routes to better servers. He was careful to caveat: he doesn’t actually know if the model is smarter, and he suspected 1M had earlier routing issues that may already be fixed.

For long running series of small changes, the 1m context is best. But stuff like “this one thing is very tricky and I need the absolute smartest version ever” works better with the 200k context limit.
— Jason Coleman, Paid Memberships Pro

My reply in the thread was literally “trying now.” That’s the eval origin story. “Feels smarter” is the kind of claim placebo eats alive — mood, task variance, time of day, cache state, what you ate for lunch. So instead of trusting the gut, I went and looked.

The test

About 150 lines of bash in my system-config (private repo, sorry). For each prompt: run both modes — CLAUDE_CODE_DISABLE_1M_CONTEXT=1 and =0 — capture both responses, then send the anonymized pair to Sonnet 4.6 as a blind judge with a structured rubric: correctness first, then completeness, then clarity, then brevity. Run order randomized per prompt to defeat any cache-warming bias. Outputs land in a gitignored timestamped directory so I can re-read them later without having to re-run the whole thing.

Six prompts, kept generic on purpose so the test wasn’t accidentally optimized for one of my real codebases: a refactor pass, a debug-this-trace task, a regex explainer, a design comparison, an edge-case enumeration, and a Python → Go translation. Not exhaustive. Not pretending to be.

claude --settings '<json-string>' lets you override just the env block per invocation. No $HOME isolation, no copying settings.json, no shell wrapper.
The implementation trick that made this a 6-minute test instead of an afternoon

What the data said

n=6. Anyone who’s run a real eval is already wincing — that’s fine, I’m wincing too. Here’s what came out:

Mode	Wins
200k	2
1M	1
Ties	3

Six prompts, blind-judged by Sonnet 4.6. Mean scores 4.33 (200k) vs 4.17 (1M) on a 5-point rubric.

The aggregate gap is well inside what a coin flip produces in 6 trials. No detectable quality difference — and that’s a real finding, not a non-finding. “There’s nothing here” beats “feels smarter” every time.

The interesting bit isn’t the average

The most interesting data point is where 200k lost.

On the edge-case-enumeration prompt, 200k confidently claimed parse_duration("1.5h") returns 300 seconds. It actually returns 18000 — the regex matches ("5", "h"), so 5 hours, not half an hour. 1M caught the right answer and surfaced a Unicode-digit edge case that 200k missed entirely. Concrete arithmetic miss, not a stylistic preference.

The two 200k wins were small in comparison. math.pi instead of a hardcoded 3.14159 in a refactor. Idiomatic strings.Cut instead of a manual strings.Index in the Python → Go translation. Polish, not depth. One careful arithmetic miss for 200k, two stylistic flourishes for it. Asymmetry I’d have wanted the other way around.

What Jason was actually right about

The “200k routes to smarter servers” framing is probably wrong. There’s no directional signal across task types in this set, and the place 200k visibly underperformed was a numerically-precise task — the exact thing you’d want the smarter version for.

But Jason’s workflow framing — long-horizon to 1M, sharp focused work to 200k — that one I think is right.

Recent personal evidence: a few weeks back I was doing MCP template optimization work and 1M with Opus 4.7 visibly outperformed 200k. Not subtle. Template structure was the bottleneck and 1M handled it cleanly while 200k kept losing the thread across files. Workload-shaped, not vibes-shaped. The eval didn’t settle the smarter question. What it accidentally settled is that wasn’t the right question.

What I do now

200k as the default. Not because it’s smarter — because the 1M overhead isn’t doing anything for me given how I already work. I /compact at clean checkpoints anyway. I’d rather hit /compact at a point I chose than let context creep me into a soft cap that triggers compaction at a point I didn’t.

For long-horizon loops where compaction would lose meaningful state, or template-heavy work like the MCP optimization — flip back. CLAUDE_CODE_DISABLE_1M_CONTEXT=0 claude. Per-session, no settings.json edit. The toggle is the right unit, not the default.

What this isn’t

n=6. Generic prompts. Single judge model. Sonnet judging Opus reduces self-preference bias but doesn’t eliminate distribution-fit bias. The two cases where the judge picked a clear winner both cited objectively verifiable things — math errors, library API choice — so those I trust. Ties I trust less.

The honest version would be n=30+ pulled from real recent work, with at least two judges, and a category breakdown. At that point you’re spending real money to settle a curiosity, and the curiosity probably doesn’t deserve it.

But “no detectable difference at n=6, with one concrete miss on the 200k side and a workload-dependent caveat that points back at the workflow framing” is a stronger thing to know than “feels smarter.” That’s the point of building the thing.

If you want to run the same test on your own prompts: the harness is about 150 lines of bash, happy to share if anyone DMs me. Swap your real recent tasks into scripts/eval/prompts/ and you’ll get a much more meaningful answer than generic prompts can give. Six minutes of running, and you find out whether the thing you’ve been “feeling” is real or routing-was-flaky-three-weeks-ago noise.

Thanks to Jason and Katie for the thread that started this. Quotes verbatim from the mastermind Slack — Jason, holler if you want the framing tweaked before this lands.