AI Model Leaderboard

Powered by Artificial Analysis Intelligence Index

A composite benchmark aggregating seven challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

0 - 70
0 - 10,000,000
$0 - $35
OpenAI1

GPT-5.1 Chat (high)

OpenAI

Click to expand
AI Index
67
Context
400K
Cost/1M
$3.44
Available on LeemerChat

GPT‑5 family: pick based on cost/latency; quality scales with tier.

OpenAI2

GPT-5.1 Chat (medium)

OpenAI

Click to expand
AI Index
66
Context
400K
Cost/1M
$3.44
Available on LeemerChat

GPT‑5 family: pick based on cost/latency; quality scales with tier.

Grok3

Grok 4

xAI

Click to expand
AI Index
65
Context
256K
Cost/1M
$6.00
Available on LeemerChat

xAI’s most capable and sharpest model—confident, analytical, and fast. It reasons like other flagship systems while keeping Grok’s lively style, making it a strong choice for tough math, code, and synthesis tasks.

Showing 116 of 116 models

Want to unlock the best models?

OpenAI4

GPT-5 mini (high)

Available

OpenAI

AI Index
62
Context
400K
Cost/1M
$0.69
OpenAI5

GPT-5.1 Chat (low)

Available

OpenAI

AI Index
62
Context
400K
Cost/1M
$3.44
OpenAI6

GPT-5 mini (medium)

Available

OpenAI

AI Index
61
Context
400K
Cost/1M
$0.69
Grok7

Grok 4 Fast

Available

xAI

AI Index
60
Context
2M
Cost/1M
$0.28
Google8

Gemini 2.5 Pro

Available

Google

AI Index
60
Context
1M
Cost/1M
$3.44
Anthropic9

Claude 4.1 Opus

Anthropic

AI Index
59
Context
200K
Cost/1M
$30.00
OpenAI10

gpt-oss-120B (high)

Available

OpenAI

AI Index
58
Context
131K
Cost/1M
$0.26
Alibaba11

Qwen3 235B 2507

Available

Alibaba

AI Index
57
Context
256K
Cost/1M
$2.63
Grok12

Grok 3 mini Reasoning (high)

xAI

AI Index
57
Context
1M
Cost/1M
$0.35
Anthropic13

Claude 4 Sonnet

Available

Anthropic

AI Index
57
Context
1M
Cost/1M
$6.00
Alibaba14

Qwen3 Next 80B A3B

Available

Alibaba

AI Index
54
Context
262K
Cost/1M
$1.88
Anthropic15

Claude 4 Opus

Anthropic

AI Index
54
Context
200K
Cost/1M
$30.00
DeepSeek16

DeepSeek V3.1

Available

DeepSeek

AI Index
54
Context
128K
Cost/1M
$0.96
Mistral17

Magistral Medium 1.2

Mistral

AI Index
52
Context
128K
Cost/1M
$2.75
DeepSeek18

DeepSeek R1 0528

DeepSeek

AI Index
52
Context
128K
Cost/1M
$0.96
Google19

Gemini 2.5 Flash

Available

Google

AI Index
51
Context
1M
Cost/1M
$0.85
MoonshotAI20

Kimi K2 0905

Available

Moonshot AI

AI Index
50
Context
256K
Cost/1M
$1.34
Zhipu21

GLM-4.5

Available

ZhipuAI

AI Index
49
Context
128K
Cost/1M
$0.97
Zhipu22

GLM-4.5-Air

ZhipuAI

AI Index
49
Context
128K
Cost/1M
$0.42
Grok23

Grok Code Fast 1

xAI

AI Index
49
Context
256K
Cost/1M
$0.53
Alibaba24

Qwen3 Max (Preview)

Alibaba

AI Index
49
Context
262K
Cost/1M
$2.40
OpenAI25

GPT-5 nano (high)

OpenAI

AI Index
49
Context
400K
Cost/1M
$0.14
MoonshotAI26

Kimi K2

Moonshot AI

AI Index
48
Context
128K
Cost/1M
$1.07
OpenAI27

GPT-5 nano (medium)

OpenAI

AI Index
48
Context
400K
Cost/1M
$0.14
Alibaba28

Qwen3 30B 2507

Alibaba

AI Index
46
Context
262K
Cost/1M
$0.75
Minimax29

MiniMax M1 80k

MiniMax

AI Index
46
Context
1M
Cost/1M
$0.82
Alibaba30

Qwen3 235B 2507

Available

Alibaba

AI Index
45
Context
256K
Cost/1M
$1.23
Nvidia31

Llama Nemotron Super 49B v1.5

NVIDIA

AI Index
45
Context
128K
Cost/1M
$0.17
Alibaba32

Qwen3 Next 80B A3B

Available

Alibaba

AI Index
45
Context
262K
Cost/1M
$0.88
OpenAI33

gpt-oss-20B (high)

OpenAI

AI Index
45
Context
131K
Cost/1M
$0.09
DeepSeek34

DeepSeek V3.1

DeepSeek

AI Index
45
Context
128K
Cost/1M
$0.48
Anthropic35

Claude 4.1 Opus

Anthropic

AI Index
45
Context
200K
Cost/1M
$30.00
Anthropic36

Claude 4 Sonnet

Anthropic

AI Index
44
Context
1M
Cost/1M
$6.00
OpenAI37

GPT-5.1 Chat (minimal)

OpenAI

AI Index
43
Context
400K
Cost/1M
$3.44
OpenAI38

GPT-4.1

OpenAI

AI Index
43
Context
1M
Cost/1M
$3.50
Alibaba39

Qwen3 4B 2507

Alibaba

AI Index
43
Context
262K
Cost/1M
$0.00
Mistral40

Magistral Small 1.2

Mistral

AI Index
43
Context
128K
Cost/1M
$0.75
LG AI41

EXAONE 4.0 32B

LG AI Research

AI Index
43
Context
131K
Cost/1M
$0.70
OpenAI42

GPT-4.1 mini

OpenAI

AI Index
42
Context
1M
Cost/1M
$0.70
Anthropic43

Claude 4 Opus

Anthropic

AI Index
42
Context
200K
Cost/1M
$30.00
Minimax45

MiniMax M1 40k

MiniMax

AI Index
42
Context
1M
Cost/1M
$0.82
OpenAI46

GPT-5 mini (minimal)

OpenAI

AI Index
42
Context
400K
Cost/1M
$0.69
NousResearch47

Hermes 4 405B

Available

Nous Research

AI Index
42
Context
128K
Cost/1M
$1.50
Grok48

Grok 3 Reasoning Beta

xAI

AI Index
41
Context
1M
Cost/1M
$0.00
Google49

Gemini 2.5 Flash

Available

Google

AI Index
40
Context
1M
Cost/1M
$0.85
Google50

Gemini 2.5 Flash-Lite

Google

AI Index
40
Context
1M
Cost/1M
$0.17
NousResearch51

Hermes 4 - Llama-3.1 70B

Available

Nous Research

AI Index
39
Context
128K
Cost/1M
$0.20
Grok52

Grok 4 Fast

Available

xAI

AI Index
39
Context
2M
Cost/1M
$0.28
Nvidia53

Llama Nemotron Ultra

NVIDIA

AI Index
38
Context
128K
Cost/1M
$0.90
Nvidia54

NVIDIA Nemotron Nano 9B V2

NVIDIA

AI Index
38
Context
131K
Cost/1M
$0.07
Alibaba55

QwQ-32B

Alibaba

AI Index
38
Context
131K
Cost/1M
$0.48
Upsate56

Solar Pro 2

Upstage

AI Index
38
Context
66K
Cost/1M
$0.50
Zhipu57

GLM-4.5V

Available

ZhipuAI

AI Index
37
Context
64K
Cost/1M
$0.85
Alibaba58

Qwen3 30B 2507

Alibaba

AI Index
37
Context
262K
Cost/1M
$0.35
Nvidia59

NVIDIA Nemotron Nano 9B V2

NVIDIA

AI Index
37
Context
131K
Cost/1M
$0.07
Grok60

Grok 3

xAI

AI Index
36
Context
1M
Cost/1M
$6.00
Meta61

Llama 4 Maverick

Available

Meta

AI Index
36
Context
1M
Cost/1M
$0.39
Nvidia62

Llama 3.3 Nemotron Super 49B

NVIDIA

AI Index
35
Context
128K
Cost/1M
$0.00
Mistral63

Mistral Medium 3.1

Mistral

AI Index
35
Context
128K
Cost/1M
$0.80
DeepSeek64

DeepSeek R1 0528 Qwen3 8B

Available

DeepSeek

AI Index
35
Context
33K
Cost/1M
$0.07
Mistral65

Mistral Medium 3

Mistral

AI Index
35
Context
128K
Cost/1M
$0.80
Mistral66

Magistral Medium 1

Mistral

AI Index
34
Context
40K
Cost/1M
$2.75
LG AI67

EXAONE 4.0 32B

LG AI Research

AI Index
33
Context
131K
Cost/1M
$0.70
Alibaba68

Qwen3 Coder 30B

Available

Alibaba

AI Index
33
Context
262K
Cost/1M
$0.90
NousResearch69

Hermes 4 405B

Nous Research

AI Index
33
Context
128K
Cost/1M
$1.50
70

Reka Flash 3

Reka AI

AI Index
33
Context
128K
Cost/1M
$0.35
Mistral71

Magistral Small 1

Mistral

AI Index
32
Context
40K
Cost/1M
$0.75
72

Nova Premier

Amazon

AI Index
31
Context
1M
Cost/1M
$5.00
Upsate73

Solar Pro 2

Upstage

AI Index
30
Context
66K
Cost/1M
$0.50
Google74

Gemini 2.5 Flash-Lite

Google

AI Index
30
Context
1M
Cost/1M
$0.17
Mistral75

Mistral Small 3.2

Mistral

AI Index
29
Context
128K
Cost/1M
$0.15
OpenAI76

GPT-5 nano (minimal)

OpenAI

AI Index
29
Context
400K
Cost/1M
$0.14
Meta77

Llama 4 Scout

Meta

AI Index
28
Context
10M
Cost/1M
$0.26
Cohere78

Command A

Cohere

AI Index
28
Context
256K
Cost/1M
$4.38
Meta79

Llama 3.3 70B

Meta

AI Index
28
Context
128K
Cost/1M
$0.60
Mistral80

Devstral Medium

Mistral

AI Index
28
Context
256K
Cost/1M
$0.80
OpenAI81

GPT-4.1 nano

OpenAI

AI Index
27
Context
1M
Cost/1M
$0.17
LG AI82

Exaone 4.0 1.2B

LG AI Research

AI Index
27
Context
64K
Cost/1M
$0.00
Nvidia83

Llama Nemotron Super 49B v1.5

NVIDIA

AI Index
27
Context
128K
Cost/1M
$0.17
Nvidia84

Llama 3.1 Nemotron Nano 4B v1.1

NVIDIA

AI Index
26
Context
128K
Cost/1M
$0.00
Zhipu85

GLM-4.5V

ZhipuAI

AI Index
26
Context
64K
Cost/1M
$0.90
Nvidia86

Llama 3.3 Nemotron Super 49B v1

NVIDIA

AI Index
26
Context
128K
Cost/1M
$0.00
Minimax87

MiniMax-Text-01

MiniMax

AI Index
26
Context
4M
Cost/1M
$0.42
Meta88

Llama 3.1 405B

Meta

AI Index
26
Context
128K
Cost/1M
$3.50
OpenAI89

GPT-4o (ChatGPT)

Available

OpenAI

AI Index
25
Context
128K
Cost/1M
$7.50
Azure90

Phi-4

Microsoft

AI Index
25
Context
16K
Cost/1M
$0.22
NousResearch91

Hermes 4 70B

Available

Nous Research

AI Index
24
Context
128K
Cost/1M
$0.20
Nvidia92

Llama 3.1 Nemotron 70B

NVIDIA

AI Index
24
Context
128K
Cost/1M
$0.60
Google93

Gemma 3 27B

Google

AI Index
22
Context
128K
Cost/1M
$0.00
OpenAI94

GPT-4o mini

Available

OpenAI

AI Index
21
Context
128K
Cost/1M
$0.26
AI2195

Jamba 1.7 Large

AI21 Labs

AI Index
21
Context
256K
Cost/1M
$3.50
Google96

Gemma 3 12B

Google

AI Index
21
Context
128K
Cost/1M
$0.24
LG AI97

Exaone 4.0 1.2B

LG AI Research

AI Index
20
Context
64K
Cost/1M
$0.00
Perplexity98

R1 1776

Perplexity

AI Index
19
Context
128K
Cost/1M
$3.50
Meta99

Llama 3.2 90B (Vision)

Meta

AI Index
19
Context
128K
Cost/1M
$0.72
Mistral100

Devstral Small

Mistral

AI Index
18
Context
256K
Cost/1M
$0.15
Google101

Gemma 3n E4B

Google

AI Index
16
Context
32K
Cost/1M
$0.03
NousResearch102

DeepHermes 3 - Mistral 24B

Nous Research

AI Index
16
Context
32K
Cost/1M
$0.00
IBM103

Granite 3.3 8B

IBM

AI Index
15
Context
128K
Cost/1M
$0.09
Google104

Gemma 3 4B

Google

AI Index
15
Context
128K
Cost/1M
$0.05
Meta105

Llama 3.2 11B (Vision)

Meta

AI Index
15
Context
128K
Cost/1M
$0.16
Mistral106

Codestral (Jan)

Mistral

AI Index
13
Context
256K
Cost/1M
$0.45
Azure107

Phi-4 Multimodal

Microsoft

AI Index
12
Context
128K
Cost/1M
$0.00
108

LFM2 1.2B

Liquid AI

AI Index
8
Context
33K
Cost/1M
$0.00
Google109

Gemma 3n E2B

Google

AI Index
8
Context
32K
Cost/1M
$0.00
Mistral110

Ministral 8B

Mistral

AI Index
8
Context
128K
Cost/1M
$0.10
Google111

Gemma 3 1B

Google

AI Index
6
Context
32K
Cost/1M
$0.00
Cohere112

Aya Expanse 32B

Cohere

AI Index
6
Context
128K
Cost/1M
$0.75
Mistral113

Ministral 3B

Mistral

AI Index
5
Context
128K
Cost/1M
$0.04
AI21114

Jamba 1.7 Mini

AI21 Labs

AI Index
4
Context
258K
Cost/1M
$0.25
NousResearch115

DeepHermes 3 - Llama-3.1 8B

Nous Research

AI Index
2
Context
128K
Cost/1M
$0.00
Cohere116

Aya Expanse 8B

Cohere

AI Index
2
Context
8K
Cost/1M
$0.75
Grok117

Grok 3 mini Reasoning (low)

xAI

AI Index
0
Context
1M
Cost/1M
$0.35

How Does Artificial Analysis Really Test AI Models—And Why Should We Trust Them?

In the rapidly evolving world of AI, countless benchmarks claim to measure model intelligence. But when you see those neat leaderboards ranking GPT-4, Claude, and Gemini, how do you know the results are actually reliable?

The Trust Problem in AI Benchmarking

Most AI benchmarks suffer from serious flaws: cherry-picked examples, inconsistent testing conditions, or evaluation methods that favor certain model architectures. Artificial Analysis tackles this head-on with what they call their Intelligence Index—a comprehensive suite that puts models through identical, rigorous testing across multiple dimensions.

Four Pillars of Reliable Testing

1. Standardized Everything

Every model faces identical conditions: same prompts, same temperature settings (0 for standard models, 0.6 for reasoning models), same maximum token limits. No exceptions, no special treatment.

2. Unbiased Evaluation

They use sophisticated answer extraction methods that don't penalize models for different response styles. Multiple regex patterns catch various answer formats, and when automated scoring isn't enough, they deploy LLM-based equality checkers.

3. Zero-Shot Testing

No hand-holding with examples or demonstrations. Models must follow clear instructions from scratch—the way they'd actually be used in practice.

4. Complete Transparency

Unlike many benchmarking organizations, Artificial Analysis publishes their exact prompts, evaluation criteria, and even their regex patterns for answer extraction.

What Gets Tested (And Why It Matters)

The Academic Gauntlet

  • MMLU-Pro: 12,032 questions spanning physics to philosophy, with 10 answer choices instead of the usual 4
  • Humanity's Last Exam (HLE): 2,684 frontier-level questions designed to challenge even the best models
  • GPQA Diamond: 198 graduate-level scientific problems that stump non-experts

Mathematical Precision

AIME 2025: Competition-level math problems where answers must be exact integers between 1-999. They use both symbolic computation (SymPy) and LLM verification to catch equivalent answers.

Real-World Coding

  • SciCode: 338 scientific computing tasks that require domain expertise
  • LiveCodeBench: 315 competitive programming problems with pass@1 scoring (no partial credit)

Beyond Simple Q&A

  • Long Context Reasoning: 100 questions requiring analysis of ~100,000-token documents
  • Instruction Following: 294 precise tasks testing whether models can follow complex directions
  • Agentic Workflows: Terminal-based tasks and conversational scenarios that simulate real AI assistant work

The Reliability Factor

Here's what sets Artificial Analysis apart: multiple attempts with statistical rigor. Each evaluation runs multiple times (up to 10 repeats for challenging math problems), with automatic retries for API failures. They estimate a 95% confidence interval of less than ±1% for their overall Intelligence Index.

When models get "stuck" in agentic tasks, they deploy sophisticated loop-detection algorithms. When automated scoring might miss nuanced answers, they use carefully validated LLM judges (like Llama 3.3 70B for math equivalence, tested to >99% accuracy against human judgment).

Why This Approach Works

Comprehensive Coverage: Rather than focusing on one capability, they test reasoning, knowledge, math, coding, and instruction-following—then weight them appropriately in a unified score.

Failure Handling: Up to 30 automatic retries on API failures, with manual review of persistent issues. Results aren't published if technical problems compromise reliability.

Real-World Conditions: Testing happens in Ubuntu 22.04 with Python 3.12—the kind of environment these models actually encounter in practice.

The Bottom Line

Artificial Analysis doesn't just test AI models—they've built a systematic approach to measuring machine intelligence that prioritizes reliability over flashy headlines. Their methodology addresses the key weaknesses that plague other benchmarks: inconsistent testing, biased evaluation, and lack of transparency.

When you see their Intelligence Index rankings, you're looking at results from one of the most rigorous testing processes in the AI evaluation space. That's why researchers, developers, and companies increasingly rely on their benchmarks to make informed decisions about model capabilities.