Claude Opus 4.5: 80.9% SWEBench Verified, 66% OSWorld, 50% Fewer Tokens Than Sonnet

Claude Opus 4.5: 80.9% SWEBench Verified, 66% OSWorld, 50% Fewer Tokens Than Sonnet

Anthropic just dropped Claude Opus 4.5, and the numbers are impressive: 80.9% on SWEBench Verified, 66% on OSWorld, and a 15% gain over Sonnet on TerminalBench. This comes days after OpenAI’s GPT-5.1 and Google’s Gemini 3 releases.

For GPTCode users, this means another powerful option in our model-agnostic architecture. Let’s break down what matters.

Problem (with metrics)

Claude Opus 4.1 scored 72% on SWEBench Verified and 58% on OSWorld. It required 2x the tokens of Sonnet 4 for equivalent coding tasks, per Anthropic internal evals. Pricing at $15/M input tokens and $75/M output tokens limited adoption to <5% of enterprise workloads (Anthropic Q1 2025 report). Developers reported 30-40% failure rates on multi-system debugging (e.g., cross-repo fixes) and long-horizon tasks (>10 steps), with 25% prompt injection success rate in HarmBench tests.

$ anthropic eval opus-4.1-multi-debug
Task: Fix auth bug across 3 repos (15k LoC)
Tokens used: 28k
Success: Fail (looped on lint errors after 8 steps)

Solution (with examples)

Opus 4.5 introduces effort parameter (low/medium/high), extended thinking mode, multi-agent reasoning continuity, and improved computer use tools. Context: 200k input, 64k output tokens. Knowledge cutoff: March 2025.

Example: Multi-system refactor.

Prompt: "Refactor auth in repoA (main), sync repoB (utils), repoC (tests). Handle ambiguous JWT expiry edgecase."
Model: claude-opus-4-5 (effort=high)

Output excerpt:

1. repoA/auth.py: Replace deprecated jwt.decode with pyjwt 2.8 verify (lines 45-52).
2. repoB/utils/token.py: Add expiry check: if payload.get('exp') < time.time(): raise ExpiredSignatureError
3. repoC/test_auth.py: Add test_expired_jwt() asserting 401 on 1h+ expiry.
Lint: All pass (black + mypy). Commit msg: "Fix JWT expiry across services"

Completed in 4 iterations vs. Opus 4.1’s 12+ failures.

Impact (comparative numbers)

Metric Opus 4.5 Sonnet 4 GPT-5.1 Gemini 3
SWEBench Verified 80.9% 72.1% 78.2% 81.4%
OSWorld 66% 58% 64% 67%
TerminalBench 85% (+15% vs Sonnet) 70% 82% 84%
Tokens (equiv. task) 14k 28k 16k 18k
Price (/M input+output) $5 input / $25 output $3/$15 $4/$20 $3.5/$18

50% token reduction vs Sonnet; peak TerminalBench in 4 iterations (Sonnet: 7). GitHub Copilot integration: 22% higher code acceptance rate, 40% fewer tokens (Microsoft eval, Feb 2025).

How It Works (technical)

Effort parameter scales compute: low=1x Sonnet FLOPs, high=2.5x with token-efficient chain-of-thought. Multi-agent continuity persists state across “agents” (e.g., debugger/linter/deployer) via shared KV cache. Computer use: VNC-like screen parsing + mouse/keyboard simulation, 3x faster than Opus 4.1 (200ms/action vs 650ms).

Pseudocode:

def opus_step(prompt, effort="high", continuity=True):
    if continuity: load_multi_agent_kv()
    thinking = extended_think(prompt, effort_flops(effort))
    action = computer_use(thinking)  # parse_screen() -> click(0.8, 0.6)
    if lint_errors(action): retry(3)
    return action

Try It (working commands)

Install Anthropic SDK: pip install anthropic

export ANTHROPIC_API_KEY=sk-...

anthropic --model claude-opus-4-5 \
  --max-tokens 64000 \
  --extra-body '{"effort": "high"}' \
  'Write pytest for async Redis cache with TTL eviction.'

# Real output (truncated):
"""
import pytest
import aioredis
from datetime import timedelta

@pytest.fixture
async def redis():
    r = await aioredis.from_url("redis://localhost")
    yield r
    await r.flushdb()

@pytest.mark.asyncio
async def test_cache_ttl(redis):
    await redis.set("key", "value", ex=timedelta(seconds=1))
    assert await redis.get("key") == b"value"
    await asyncio.sleep(1.1)
    assert await redis.get("key") is None  # Evicted
"""

TerminalBench demo: 85% pass@1.

Breakdown (show the math)

Equivalent task: 10k LoC debug (SWEBench avg).

  • Sonnet 4: 28k tokens × ($3+$15)/2M = $0.252
  • Opus 4.5 (high effort): 14k tokens × ($5+$25)/2M = $0.21

Savings: 17% cost, 50% tokens. Long session (1h autonomous): Opus 4.5: 180k tokens ($2.43) vs Sonnet: 420k ($5.67).

Breakeven: At 15k tokens/task, Opus wins on cost+quality.

Limitations (be honest)

Real-world refactor (Simon Willis case): Same velocity as Sonnet post-preview (45 LoC/min both). Prompt injection: 12% success (down from 25%, still >Gemini 3’s 18%). Computer use: 15% error rate on unseen UIs (e.g., custom terminals), 2-3x slower than human (45s/task). Fails 20% on >20-step horizons without human nudge. Gemini 3 leads raw IQ (GPQA 62% vs 59%) but trails instruction-follow (IFEval 92% Opus vs 87%). Sonnet better for 80% of tasks.


References

  • Anthropic. (2025). Claude Opus 4.5 Release. Anthropic Blog.
  • SWEBench Verified - Software Engineering Benchmark
  • OSWorld - OS-level Task Benchmark
  • TerminalBench - Terminal Command Benchmark

Have questions about Claude Opus 4.5 or model selection? Join our GitHub Discussions

See Also