A thread popped off in my WordPress business mastermind Slack today and I ended up writing a small bash harness to A/B test something Jason Coleman said. The eval was inconclusive.
The point of running it wasn’t.
Katie kicked things off asking 4.7 vs 4.6 for coding. Short version of my reply: 4.7 is better generally, and the workflow stuff matters more than the model bump — I’d just published the multi-stage review post the day before, that has more on the workflow side. Where Jason took the conversation is what hooked me.
He’d been quietly defaulting Claude Code back to 200k context. The thing he’d added to his settings:
"env": {
"CLAUDE_CODE_DISABLE_1M_CONTEXT": "1"
}
Plus a per-call override when he wants the long window for a specific session: CLAUDE_CODE_DISABLE_1M_CONTEXT=0 claude. His reasoning, paraphrased: 200k kind of feels smarter — folk wisdom that 200k routes to better servers. He was careful to caveat: he doesn’t actually know if the model is smarter, and he suspected 1M had earlier routing issues that may already be fixed.
For long running series of small changes, the 1m context is best. But stuff like “this one thing is very tricky and I need the absolute smartest version ever” works better with the 200k context limit.
— Jason Coleman, Paid Memberships Pro
My reply in the thread was literally “trying now.” That’s the eval origin story. “Feels smarter” is the kind of claim placebo eats alive — mood, task variance, time of day, cache state, what you ate for lunch. So instead of trusting the gut, I went and looked.
The test
About 150 lines of bash in my system-config (private repo, sorry). For each prompt: run both modes — CLAUDE_CODE_DISABLE_1M_CONTEXT=1 and =0 — capture both responses, then send the anonymized pair to Sonnet 4.6 as a blind judge with a structured rubric: correctness first, then completeness, then clarity, then brevity. Run order randomized per prompt to defeat any cache-warming bias. Outputs land in a gitignored timestamped directory so I can re-read them later without having to re-run the whole thing.
Six prompts, kept generic on purpose so the test wasn’t accidentally optimized for one of my real codebases: a refactor pass, a debug-this-trace task, a regex explainer, a design comparison, an edge-case enumeration, and a Python → Go translation. Not exhaustive. Not pretending to be.
The implementation trick that made this a 6-minute test instead of an afternoon
claude --settings '<json-string>'lets you override just the env block per invocation. No$HOMEisolation, no copying settings.json, no shell wrapper.
What the data said
n=6. Anyone who’s run a real eval is already wincing — that’s fine, I’m wincing too. Here’s what came out:
| Mode | Wins |
|---|---|
| 200k | 2 |
| 1M | 1 |
| Ties | 3 |
The aggregate gap is well inside what a coin flip produces in 6 trials. No detectable quality difference — and that’s a real finding, not a non-finding. “There’s nothing here” beats “feels smarter” every time.
The interesting bit isn’t the average
The most interesting data point is where 200k lost.
On the edge-case-enumeration prompt, 200k confidently claimed parse_duration("1.5h") returns 300 seconds. It actually returns 18000 — the regex matches ("5", "h"), so 5 hours, not half an hour. 1M caught the right answer and surfaced a Unicode-digit edge case that 200k missed entirely. Concrete arithmetic miss, not a stylistic preference.
The two 200k wins were small in comparison. math.pi instead of a hardcoded 3.14159 in a refactor. Idiomatic strings.Cut instead of a manual strings.Index in the Python → Go translation. Polish, not depth. One careful arithmetic miss for 200k, two stylistic flourishes for it. Asymmetry I’d have wanted the other way around.
What Jason was actually right about
The “200k routes to smarter servers” framing is probably wrong. There’s no directional signal across task types in this set, and the place 200k visibly underperformed was a numerically-precise task — the exact thing you’d want the smarter version for.
But Jason’s workflow framing — long-horizon to 1M, sharp focused work to 200k — that one I think is right.
Recent personal evidence: a few weeks back I was doing MCP template optimization work and 1M with Opus 4.7 visibly outperformed 200k. Not subtle. Template structure was the bottleneck and 1M handled it cleanly while 200k kept losing the thread across files. Workload-shaped, not vibes-shaped. The eval didn’t settle the smarter question. What it accidentally settled is that wasn’t the right question.
What I do now
200k as the default. Not because it’s smarter — because the 1M overhead isn’t doing anything for me given how I already work. I /compact at clean checkpoints anyway. I’d rather hit /compact at a point I chose than let context creep me into a soft cap that triggers compaction at a point I didn’t.
For long-horizon loops where compaction would lose meaningful state, or template-heavy work like the MCP optimization — flip back. CLAUDE_CODE_DISABLE_1M_CONTEXT=0 claude. Per-session, no settings.json edit. The toggle is the right unit, not the default.
What this isn’t
n=6. Generic prompts. Single judge model. Sonnet judging Opus reduces self-preference bias but doesn’t eliminate distribution-fit bias. The two cases where the judge picked a clear winner both cited objectively verifiable things — math errors, library API choice — so those I trust. Ties I trust less.
The honest version would be n=30+ pulled from real recent work, with at least two judges, and a category breakdown. At that point you’re spending real money to settle a curiosity, and the curiosity probably doesn’t deserve it.
But “no detectable difference at n=6, with one concrete miss on the 200k side and a workload-dependent caveat that points back at the workflow framing” is a stronger thing to know than “feels smarter.” That’s the point of building the thing.
If you want to run the same test on your own prompts: the harness is about 150 lines of bash, happy to share if anyone DMs me. Swap your real recent tasks into scripts/eval/prompts/ and you’ll get a much more meaningful answer than generic prompts can give. Six minutes of running, and you find out whether the thing you’ve been “feeling” is real or routing-was-flaky-three-weeks-ago noise.
Thanks to Jason and Katie for the thread that started this. Quotes verbatim from the mastermind Slack — Jason, holler if you want the framing tweaked before this lands.