Best LLMs for Coding — 2026 Rankings
Coding LLM Leaderboard
Which AI model writes the best code? We rank every major LLM — open and closed source — across software engineering, code generation, competitive programming, and agentic coding benchmarks.
Roshan Desai · Last updated: 2026-02-24
S
Claude Opus 4.6
N/A
GPT-5.2
N/A
Kimi K2.5
1T
MiniMax M2.5
230B
A
Claude Sonnet 4.6
N/A
Gemini 3 Pro
N/A
Qwen 3.5
397B
Step-3.5-Flash
196B
GLM-5
744B
MiMo-V2-Flash
309B
Mistral Large
675B
B
DeepSeek V3.2
685B
DeepSeek R1
671B
C
GPT-oss 120B
117B
Qwen2.5-Coder-32B
32B
Nemotron Ultra 253B
253B
D
DeepSeek V3
671B
Llama 4 Maverick
400B
Grok 3
N/A
Cost vs. Coding Performance
Which models give you the best coding performance for the price? Top-left is the sweet spot — high performance, low cost.
Top-left = best value (high performance, low cost). Models without pricing data are excluded.
Best Coding LLMs by Benchmark
How does each model perform on real-world software engineering, code generation, competitive programming, and terminal-based coding tasks? Hover any bar for details.
Best at Software Engineering
Real-world software engineering tasks (SWE-bench Verified)
Best for Code Generation
Python code generation from docstrings (HumanEval)
Best in Competitive Coding
Real-world coding problems (LiveCodeBench)
Best at Terminal Coding
Agentic terminal coding tasks (Terminal-Bench 2.0)
Coding Benchmark Scores & Pricing
Complete coding benchmark results and pricing for every model. Click any column header to sort.
Filter:
Claude Opus 4.6 Anthropic | N/A | 200K | $15.00 | $75.00 | 82.0 | 91.3 | 80.8 | 95.0 | 76.0 | 65.4 |
Claude Sonnet 4.6 Anthropic | N/A | 200K | $3.00 | $15.00 | 79.1 | 89.9 | 79.6 | 92.1 | 72.4 | 59.1 |
DeepSeek R1 DeepSeek | 671B | 128K | $0.28 | $0.42 | 84.0 | 71.5 | 49.2 | 90.2 | 65.9 | N/A |
DeepSeek V3 DeepSeek | 671B | 128K | $0.28 | $1.10 | 81.2 | 68.4 | 38.8 | N/A | 49.2 | N/A |
DeepSeek V3.2 DeepSeek | 685B | 130K | $0.28 | $0.42 | 85.0 | 79.9 | 67.8 | N/A | 74.1 | 39.6 |
Gemini 3 Pro | N/A | 1M | $1.25 | $10.00 | 85.0 | 91.9 | 78.0 | 93.0 | 81.3 | 56.2 |
GLM-5 Zhipu AI | 744B | 200K | N/A | N/A | 70.4 | 86.0 | 77.8 | 90.0 | 52.0 | 56.2 |
GPT-5.2 OpenAI | N/A | 128K | $2.00 | $8.00 | N/A | 93.2 | 80.0 | 95.0 | 80.0 | 64.7 |
GPT-oss 120B OpenAI | 117B | 128K | N/A | N/A | 90.0 | 80.9 | 62.4 | 88.3 | 60.0 | 18.7 |
Grok 3 xAI | N/A | 131K | $3.00 | $15.00 | N/A | 84.6 | 49.0 | 94.5 | 79.4 | 52.0 |
Kimi K2.5 Moonshot | 1T | 262K | N/A | N/A | 87.1 | 87.6 | 76.8 | 99.0 | 85.0 | 50.8 |
Llama 4 Maverick Meta | 400B | 1M | N/A | N/A | 80.5 | 69.8 | N/A | 62.0 | 43.4 | N/A |
MiMo-V2-Flash Xiaomi | 309B | 262K | N/A | N/A | 84.9 | 83.7 | 73.4 | 84.8 | 80.6 | 38.5 |
MiniMax M2.5 MiniMax | 230B | 205K | $0.30 | $1.20 | 76.5 | 85.2 | 80.2 | 89.6 | 65.0 | 42.2 |
Mistral Large Mistral | 675B | 256K | N/A | N/A | N/A | 43.9 | N/A | 92.0 | 82.8 | N/A |
Nemotron Ultra 253B Nvidia | 253B | 128K | N/A | N/A | N/A | 76.0 | N/A | N/A | 66.3 | N/A |
Qwen 3.5 Qwen | 397B | 262K | N/A | N/A | 87.8 | 88.4 | 76.4 | N/A | 83.6 | 52.5 |
Qwen2.5-Coder-32B Qwen | 32B | 131K | N/A | N/A | N/A | N/A | N/A | 92.7 | 43.2 | N/A |
Step-3.5-Flash Stepfun | 196B | 256K | $0.10 | $0.30 | N/A | N/A | 74.4 | 81.1 | 86.4 | 51.0 |
Compare Coding LLMs Head-to-Head
Select two models to see how they compare across all coding and reasoning benchmarks.
Model A
Model B
Claude Opus 4.6
GPT-5.2
GPQA Diamond
91.3
vs
93.2
SWE-bench Verified
80.8
vs
80.0
HumanEval
95.0
vs
95.0
LiveCodeBench
76.0
vs
80.0
Terminal-Bench 2.0
65.4
vs
64.7
Benchmarks won
2
vs
2
Try These Models in Onyx
Onyx is the open-source AI platform that lets you connect any of these LLMs to your team's docs, apps, and people.