Independent analysis of AI
Understand the AI landscape to choose the best model and provider for your use case
Highlights
Intelligence
Artificial Analysis Intelligence Index; Higher is better
Speed
Output Tokens per Second; Higher is better
Price
USD per 1M Tokens; Lower is better
How do OpenAI, Google, Meta & DeepSeek's new models compare?
Recent Models Compared
How do the latest OpenAI models compare?
GPT-4.1 Models Compared
Where can you get an API for DeepSeek R1?
DeepSeek R1 Providers
Who has the best Video Generation model?
Video Arena
Which model is fastest with 100k token prompts?
Long Context Latency
Artificial Analysis Intelligence Index
Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500
+ Add model from specific provider
Artificial Analysis Intelligence Index: Combination metric covering multiple dimensions of intelligence - the simplest way to compare how smart models are. Version 2 was released in Feb '25 and includes: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Artificial Analysis Intelligence Index by Model Type
Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500
Reasoning Model
Non-Reasoning Model
+ Add model from specific provider
Artificial Analysis Intelligence Index: Combination metric covering multiple dimensions of intelligence - the simplest way to compare how smart models are. Version 2 was released in Feb '25 and includes: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Artificial Analysis Intelligence Index by Open Weights vs Proprietary
Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500
Proprietary
Open Weights
+ Add model from specific provider
Artificial Analysis Intelligence Index: Combination metric covering multiple dimensions of intelligence - the simplest way to compare how smart models are. Version 2 was released in Feb '25 and includes: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Open Weights: Indicates whether the model weights are available. Models are labelled as 'Commercial Use Restricted' if the weights are available but commercial use is limited (typically requires obtaining a paid license).
Artificial Analysis Coding Index
Represents the average of coding benchmarks in the Artificial Analysis Intelligence Index (LiveCodeBench & SciCode)
+ Add model from specific provider
Artificial Analysis Coding Index: Represents the average of coding evaluations in the Artificial Analysis Intelligence Index. Currently includes: LiveCodeBench, SciCode. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Artificial Analysis Math Index
Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500)
+ Add model from specific provider
Artificial Analysis Math Index: Represents the average of math evaluations in the Artificial Analysis Intelligence Index. Currently includes: AIME, MATH-500. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Frontier Language Model Intelligence, Over Time
Artificial Analysis Intelligence Index incorporates 7 evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500
OpenAI
Meta
Google
Anthropic
Mistral
DeepSeek
xAI
Alibaba
Artificial Analysis Intelligence Index: Combination metric covering multiple dimensions of intelligence - the simplest way to compare how smart models are. Version 2 was released in Feb '25 and includes: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Intelligence Evaluations
Intelligence evaluations measured independently by Artificial Analysis; Higher is better
Results claimed by AI Lab (not yet independently verified)
MMLU-Pro (Reasoning & Knowledge)
GPQA Diamond (Scientific Reasoning)
Humanity's Last Exam (Reasoning & Knowledge)
LiveCodeBench (Coding)
SciCode (Coding)
HumanEval (Coding)
MATH-500 (Quantitative Reasoning)
AIME 2024 (Competition Math)
+ Add model from specific provider
Artificial Analysis Intelligence Index: Combination metric covering multiple dimensions of intelligence - the simplest way to compare how smart models are. Version 2 was released in Feb '25 and includes: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Intelligence vs. Price
Artificial Analysis Intelligence Index; Price: USD per 1M Tokens
Most attractive quadrant
GPT-4.1
o3
o4-mini (high)
Llama 4 Maverick
Llama 4 Scout
Gemini 2.5 Flash (Reasoning)
Gemini 2.5 Pro (Jun '25)
Claude 4 Opus Thinking
Claude 4 Sonnet Thinking
Claude 4 Sonnet
Mistral Medium 3
DeepSeek R1 0528 (May '25)
DeepSeek V3 0324 (Mar '25)
Grok 3 mini Reasoning (high)
Nova Premier
Llama Nemotron Ultra Reasoning
Qwen3 235B (Reasoning)
GPT-4o (Nov '24)
DeepSeek R1 (Jan '25)
+ Add model from specific provider
Artificial Analysis Intelligence Index: Combination metric covering multiple dimensions of intelligence - the simplest way to compare how smart models are. Version 2 was released in Feb '25 and includes: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Intelligence vs. Output Speed
Artificial Analysis Intelligence Index; Output Speed: Output Tokens per Second
Most attractive quadrant
GPT-4.1
o3
o4-mini (high)
Llama 4 Maverick
Llama 4 Scout
Gemini 2.5 Flash (Reasoning)
Gemini 2.5 Pro (Jun '25)
Claude 4 Opus Thinking
Claude 4 Sonnet Thinking
Claude 4 Sonnet
Mistral Medium 3
DeepSeek R1 0528 (May '25)
DeepSeek V3 0324 (Mar '25)
Grok 3 mini Reasoning (high)
Nova Premier
Llama Nemotron Ultra Reasoning
Qwen3 235B (Reasoning)
GPT-4o (Nov '24)
+ Add model from specific provider
Artificial Analysis Intelligence Index: Combination metric covering multiple dimensions of intelligence - the simplest way to compare how smart models are. Version 2 was released in Feb '25 and includes: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Output Speed
Output Tokens per Second; Higher is better
+ Add model from specific provider
Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Latency: Time To First Answer Token
Seconds to First Answer Token Received; Accounts for Reasoning Model 'Thinking' time
Input processing
Thinking (reasoning models, when applicable)
+ Add model from specific provider
Time To First Answer Token: Time to first answer token received, in seconds, after API request sent. For reasoning models, this includes the 'thinking' time of the model before providing an answer. For models which do not support streaming, this represents time to receive the completion.
End-to-End Response Time
Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better
Input processing time
'Thinking' time (reasoning models)
Outputting time
+ Add model from specific provider
End-to-End Response Time: Seconds to receive a 500 token response. Key components:
- Input time: Time to receive the first response token
- Thinking time (only for reasoning models): Time reasoning models spend outputting tokens to reason prior to providing an answer. Amount of tokens based on the average reasoning tokens across a diverse set of 60 prompts (methodology details).
- Answer time: Time to generate 500 output tokens, based on output speed
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Pricing: Input and Output Prices
Price: USD per 1M Tokens
Input price
Output price
+ Add model from specific provider
Input Price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
API Provider Highlights: Llama 4 Maverick
Output Speed vs. Price: Llama 4 Maverick
Output Speed: Output Tokens per Second, Price: USD per 1M Tokens; 1,000 Input Tokens
Most attractive quadrant
Lambda (FP8)
Parasail (FP8)
Amazon
Google Vertex
CentML (FP8)
Azure (FP8)
Fireworks (Base)
Deepinfra (FP8)
Deepinfra (Turbo, FP8)
Novita (FP8)
GMI (FP8)
Groq
SambaNova
Together.ai
kluster.ai (FP8)
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Median: Figures represent median (P50) measurement over the past 72 hours to reflect sustained changes in performance.
Notes: Llama 4 Maverick (FP8), Parasail: 1m context, Llama 4 Maverick, Amazon: 128k context, Llama 4 Maverick Vertex, Google: 524k context, Llama 4 Maverick (FP8), CentML: 1m context, Llama 4 Maverick (FP8), Azure: 128k context, Llama 4 Maverick (Base), Fireworks: 1m context, Llama 4 Maverick (FP8), Deepinfra: 131k context, Llama 4 Maverick (Turbo, FP8), Deepinfra: 8k context, Llama 4 Maverick (FP8), Novita: 1m context, Llama 4 Maverick (FP8), GMI: 1m context, Llama 4 Maverick, Groq: 128k context, Llama 4 Maverick, SambaNova: 131k context, Llama 4 Maverick, Together.ai: 524k context
Pricing (Input and Output Prices): Llama 4 Maverick
Price: USD per 1M Tokens; Lower is better; 1,000 Input Tokens
Input price
Output price
Input Price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Output Price: Price per token generated by the model (received from the API), represented as USD per million Tokens.
Notes: Llama 4 Maverick (FP8), Parasail: 1m context, Llama 4 Maverick, Amazon: 128k context, Llama 4 Maverick Vertex, Google: 524k context, Llama 4 Maverick (FP8), CentML: 1m context, Llama 4 Maverick (FP8), Azure: 128k context, Llama 4 Maverick (Base), Fireworks: 1m context, Llama 4 Maverick (FP8), Deepinfra: 131k context, Llama 4 Maverick (Turbo, FP8), Deepinfra: 8k context, Llama 4 Maverick (FP8), Novita: 1m context, Llama 4 Maverick (FP8), GMI: 1m context, Llama 4 Maverick, Groq: 128k context, Llama 4 Maverick, SambaNova: 131k context, Llama 4 Maverick, Together.ai: 524k context
Output Speed: Llama 4 Maverick
Output Speed: Output Tokens per Second; 1,000 Input Tokens
Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Notes: Llama 4 Maverick (FP8), Parasail: 1m context, Llama 4 Maverick, Amazon: 128k context, Llama 4 Maverick Vertex, Google: 524k context, Llama 4 Maverick (FP8), CentML: 1m context, Llama 4 Maverick (FP8), Azure: 128k context, Llama 4 Maverick (Base), Fireworks: 1m context, Llama 4 Maverick (FP8), Deepinfra: 131k context, Llama 4 Maverick (Turbo, FP8), Deepinfra: 8k context, Llama 4 Maverick (FP8), Novita: 1m context, Llama 4 Maverick (FP8), GMI: 1m context, Llama 4 Maverick, Groq: 128k context, Llama 4 Maverick, SambaNova: 131k context, Llama 4 Maverick, Together.ai: 524k context
Output Speed, Over Time: Llama 4 Maverick
Output Tokens per Second; Higher is better; 1,000 Input Tokens
Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Notes: Llama 4 Maverick (FP8), Parasail: 1m context, Llama 4 Maverick, Amazon: 128k context, Llama 4 Maverick Vertex, Google: 524k context, Llama 4 Maverick (FP8), CentML: 1m context, Llama 4 Maverick (FP8), Azure: 128k context, Llama 4 Maverick (Base), Fireworks: 1m context, Llama 4 Maverick (FP8), Deepinfra: 131k context, Llama 4 Maverick (Turbo, FP8), Deepinfra: 8k context, Llama 4 Maverick (FP8), Novita: 1m context, Llama 4 Maverick (FP8), GMI: 1m context, Llama 4 Maverick, Groq: 128k context, Llama 4 Maverick, SambaNova: 131k context, Llama 4 Maverick, Together.ai: 524k context
See more information on any of our supported models