Best LLMs for Coding — 2026 Rankings

Coding LLM Leaderboard

Which AI model writes the best code? We rank every major LLM — open and closed source — across software engineering, code generation, competitive programming, and agentic coding benchmarks.

Roshan Desai · Last updated: 2026-03-24

LLM Leaderboard (All Models)

→

Open Source LLM Leaderboard

→

Self-Hosted LLM Leaderboard

→

Claude Opus 4.6

N/A

GPT-5.4

N/A

Kimi K2.5

MiniMax M2.5

230B

Claude Sonnet 4.6

N/A

Gemini 3.1 Pro

N/A

Qwen 3.5

397B

Step-3.5-Flash

196B

GLM-5

744B

MiMo-V2-Flash

309B

Mistral Large

675B

DeepSeek V3.2

685B

DeepSeek R1

671B

GPT-oss 120B

117B

Nemotron Ultra 253B

253B

DeepSeek V3

671B

Llama 4 Maverick

400B

Grok 3

N/A

EnterpriseRAG-Bench · 500K+ docs · 9 enterprise sources · MIT

See how your RAG solution stacks up on real company data

See how we did it

Cost vs. Coding Performance

Which models give you the best coding performance for the price? Top-left is the sweet spot — high performance, low cost.

Top-left = best value (high performance, low cost). Models without pricing data are excluded.

Best Coding LLMs by Benchmark

How does each model perform on real-world software engineering, code generation, competitive programming, and terminal-based coding tasks? Hover any bar for details.

Best at Software Engineering

Real-world software engineering tasks (SWE-bench Verified)

Best for Code Generation

Python code generation from docstrings (HumanEval)

Best in Competitive Coding

Real-world coding problems (LiveCodeBench)

Best at Terminal Coding

Agentic terminal coding tasks (Terminal-Bench 2.0)

Coding Benchmark Scores & Pricing

Complete coding benchmark results and pricing for every model. Click any column header to sort.

Filter:


Claude Opus 4.6 Anthropic	N/A	200K	$15.00	$75.00	82.0	91.3	80.8	95.0	76.0	65.4
Claude Sonnet 4.6 Anthropic	N/A	200K	$3.00	$15.00	79.1	89.9	79.6	92.1	72.4	59.1
DeepSeek R1 DeepSeek	671B	128K	$0.28	$0.42	84.0	71.5	49.2	90.2	65.9	N/A
DeepSeek V3 DeepSeek	671B	128K	$0.28	$1.10	81.2	68.4	38.8	N/A	49.2	N/A
DeepSeek V3.2 DeepSeek	685B	130K	$0.28	$0.42	85.0	79.9	67.8	N/A	74.1	39.6
Gemini 3.1 Pro Google	N/A	1M	$2.00	$12.00	85.0	91.9	78.0	93.0	81.3	56.2
GLM-5 Zhipu AI	744B	200K	N/A	N/A	70.4	86.0	77.8	90.0	52.0	56.2
GPT-5.4 OpenAI	N/A	1M	$2.50	$15.00	N/A	92.8	N/A	N/A	N/A	75.1
GPT-oss 120B OpenAI	117B	128K	N/A	N/A	90.0	80.9	62.4	88.3	60.0	18.7
Grok 3 xAI	N/A	131K	$3.00	$15.00	N/A	84.6	49.0	94.5	79.4	52.0
Kimi K2.5 Moonshot	1T	262K	N/A	N/A	87.1	87.6	76.8	99.0	85.0	50.8
Llama 4 Maverick Meta	400B	1M	N/A	N/A	80.5	69.8	N/A	62.0	43.4	N/A
MiMo-V2-Flash Xiaomi	309B	262K	N/A	N/A	84.9	83.7	73.4	84.8	80.6	38.5
MiniMax M2.5 MiniMax	230B	205K	$0.30	$1.20	76.5	85.2	80.2	89.6	65.0	42.2
Mistral Large Mistral	675B	256K	N/A	N/A	N/A	43.9	N/A	92.0	82.8	N/A
Nemotron Ultra 253B Nvidia	253B	128K	N/A	N/A	N/A	76.0	N/A	N/A	66.3	N/A
Qwen 3.5 Qwen	397B	262K	N/A	N/A	87.8	88.4	76.4	N/A	83.6	52.5
Step-3.5-Flash Stepfun	196B	262K	$0.10	$0.30	85.8	N/A	74.4	81.1	86.4	51.0

Compare Coding LLMs Head-to-Head

Select two models to see how they compare across all coding and reasoning benchmarks.

Model A

Model B

Claude Opus 4.6

GPT-5.4

GPQA Diamond

91.3

92.8

Terminal-Bench 2.0

65.4

75.1

Benchmarks won

Try These Models in Onyx

Onyx is the open-source AI platform that lets you connect any of these LLMs to your team's docs, apps, and people.

Book a Demo View on GitHub