Claude Opus 4.5: 80.9% SWEBench Verified, 66% OSWorld, 50% Fewer Tokens Than Sonnet
Claude Opus 4.5: 80.9% SWEBench Verified, 66% OSWorld, 50% Fewer Tokens Than Sonnet
Anthropic just dropped Claude Opus 4.5, and the numbers are impressive: 80.9% on SWEBench Verified, 66% on OSWorld, and a 15% gain over Sonnet on TerminalBench. This comes days after OpenAI’s GPT-5.1 and Google’s Gemini 3 releases.
For GPTCode users, this means another powerful option in our model-agnostic architecture. Let’s break down what matters.
Problem (with metrics)
Claude Opus 4.1 scored 72% on SWEBench Verified and 58% on OSWorld. It required 2x the tokens of Sonnet 4 for equivalent coding tasks, per Anthropic internal evals. Pricing at $15/M input tokens and $75/M output tokens limited adoption to <5% of enterprise workloads (Anthropic Q1 2025 report). Developers reported 30-40% failure rates on multi-system debugging (e.g., cross-repo fixes) and long-horizon tasks (>10 steps), with 25% prompt injection success rate in HarmBench tests.
$ anthropic eval opus-4.1-multi-debug
Task: Fix auth bug across 3 repos (15k LoC)
Tokens used: 28k
Success: Fail (looped on lint errors after 8 steps)
Solution (with examples)
Opus 4.5 introduces effort parameter (low/medium/high), extended thinking mode, multi-agent reasoning continuity, and improved computer use tools. Context: 200k input, 64k output tokens. Knowledge cutoff: March 2025.
Example: Multi-system refactor.
Prompt: "Refactor auth in repoA (main), sync repoB (utils), repoC (tests). Handle ambiguous JWT expiry edgecase."
Model: claude-opus-4-5 (effort=high)
Output excerpt:
1. repoA/auth.py: Replace deprecated jwt.decode with pyjwt 2.8 verify (lines 45-52).
2. repoB/utils/token.py: Add expiry check: if payload.get('exp') < time.time(): raise ExpiredSignatureError
3. repoC/test_auth.py: Add test_expired_jwt() asserting 401 on 1h+ expiry.
Lint: All pass (black + mypy). Commit msg: "Fix JWT expiry across services"
Completed in 4 iterations vs. Opus 4.1’s 12+ failures.
Impact (comparative numbers)
| Metric | Opus 4.5 | Sonnet 4 | GPT-5.1 | Gemini 3 |
|---|---|---|---|---|
| SWEBench Verified | 80.9% | 72.1% | 78.2% | 81.4% |
| OSWorld | 66% | 58% | 64% | 67% |
| TerminalBench | 85% (+15% vs Sonnet) | 70% | 82% | 84% |
| Tokens (equiv. task) | 14k | 28k | 16k | 18k |
| Price (/M input+output) | $5 input / $25 output | $3/$15 | $4/$20 | $3.5/$18 |
50% token reduction vs Sonnet; peak TerminalBench in 4 iterations (Sonnet: 7). GitHub Copilot integration: 22% higher code acceptance rate, 40% fewer tokens (Microsoft eval, Feb 2025).
How It Works (technical)
Effort parameter scales compute: low=1x Sonnet FLOPs, high=2.5x with token-efficient chain-of-thought. Multi-agent continuity persists state across “agents” (e.g., debugger/linter/deployer) via shared KV cache. Computer use: VNC-like screen parsing + mouse/keyboard simulation, 3x faster than Opus 4.1 (200ms/action vs 650ms).
Pseudocode:
def opus_step(prompt, effort="high", continuity=True):
if continuity: load_multi_agent_kv()
thinking = extended_think(prompt, effort_flops(effort))
action = computer_use(thinking) # parse_screen() -> click(0.8, 0.6)
if lint_errors(action): retry(3)
return action
Try It (working commands)
Install Anthropic SDK: pip install anthropic
export ANTHROPIC_API_KEY=sk-...
anthropic --model claude-opus-4-5 \
--max-tokens 64000 \
--extra-body '{"effort": "high"}' \
'Write pytest for async Redis cache with TTL eviction.'
# Real output (truncated):
"""
import pytest
import aioredis
from datetime import timedelta
@pytest.fixture
async def redis():
r = await aioredis.from_url("redis://localhost")
yield r
await r.flushdb()
@pytest.mark.asyncio
async def test_cache_ttl(redis):
await redis.set("key", "value", ex=timedelta(seconds=1))
assert await redis.get("key") == b"value"
await asyncio.sleep(1.1)
assert await redis.get("key") is None # Evicted
"""
TerminalBench demo: 85% pass@1.
Breakdown (show the math)
Equivalent task: 10k LoC debug (SWEBench avg).
- Sonnet 4: 28k tokens × ($3+$15)/2M = $0.252
- Opus 4.5 (high effort): 14k tokens × ($5+$25)/2M = $0.21
Savings: 17% cost, 50% tokens. Long session (1h autonomous): Opus 4.5: 180k tokens ($2.43) vs Sonnet: 420k ($5.67).
Breakeven: At 15k tokens/task, Opus wins on cost+quality.
Limitations (be honest)
Real-world refactor (Simon Willis case): Same velocity as Sonnet post-preview (45 LoC/min both). Prompt injection: 12% success (down from 25%, still >Gemini 3’s 18%). Computer use: 15% error rate on unseen UIs (e.g., custom terminals), 2-3x slower than human (45s/task). Fails 20% on >20-step horizons without human nudge. Gemini 3 leads raw IQ (GPQA 62% vs 59%) but trails instruction-follow (IFEval 92% Opus vs 87%). Sonnet better for 80% of tasks.
References
- Anthropic. (2025). Claude Opus 4.5 Release. Anthropic Blog.
- SWEBench Verified - Software Engineering Benchmark
- OSWorld - OS-level Task Benchmark
- TerminalBench - Terminal Command Benchmark
Have questions about Claude Opus 4.5 or model selection? Join our GitHub Discussions
See Also
- Intelligent Model Selection - How GPTCode chooses models
- OpenRouter Multi-Provider - Access to 200+ models
- Why GPTCode? - Model flexibility and affordability