Blog · February 15, 2026
The Cheapest AI Models That Are Actually Good
Most teams pick from a shortlist of three or four well-known models — GPT-4.1, Claude 4.5 Sonnet, maybe Gemini — without checking what else is available at the same quality tier.
We ranked 403 models from our pricing database by output price within quality tiers. The results: the cheapest flagship-tier models cost $0.30–$0.42 per million output tokens. The popular defaults charge $10–$80 for comparable benchmark scores.
How we ranked them
For each model, we looked at:
- $/1M input and $/1M output — list prices per million tokens from each vendor. This is the industry-standard pricing unit.
- Intelligence Index — a composite benchmark score from MMLU-Pro, GPQA, and AIME
Models are sorted by output price ascending — cheapest first — because output tokens typically dominate cost in production. Input prices are shown alongside for workloads with large prompts or long context windows.
The top 10 cheapest good models
These models score at least 25 on the Intelligence Index (solid production quality) and have the lowest output prices. Sorted by $/1M output, cheapest first.
| # | Model | Vendor | Intelligence Index | $/1M in | $/1M out |
|---|---|---|---|---|---|
| 1 | gpt-oss-120B (high) | OpenAI | 33.3 | $0.039 | $0.19 |
| 2 | MiMo-V2-Flash | Xiaomi | 41.4 | $0.10 | $0.30 |
| 3 | GLM-4.7-Flash (Reasoning) | Z AI | 30.1 | $0.065 | $0.40 |
| 4 | GPT-5 nano (high) | OpenAI | 26.7 | $0.05 | $0.40 |
| 5 | DeepSeek V3.2 (Reasoning) | DeepSeek | 41.6 | $0.28 | $0.42 |
| 6 | Grok 4.1 Fast (Reasoning) | xAI | 38.5 | $0.20 | $0.50 |
| 7 | Grok 4 Fast (Reasoning) | xAI | 34.9 | $0.20 | $0.50 |
| 8 | Mercury 2 | Inception | 32.8 | $0.25 | $0.75 |
| 9 | MiniMax-M2.5 | MiniMax | 42.0 | $0.30 | $1.20 |
| 10 | KAT-Coder-Pro V1 | KwaiKAT | 36.1 | $0.30 | $1.20 |
The #1 spot goes to OpenAI's gpt-oss-120B: a mid-tier Intelligence Index of 33.3 at just $0.039/1M input and $0.19/1M output — by far the cheapest model with solid quality. For flagship-level intelligence, Xiaomi's MiMo-V2-Flash crosses the 40+ threshold (II 41.4) at $0.10/$0.30, and DeepSeek V3.2 (II 41.6) at $0.28/$0.42.
Eight of the ten models in this list aren't from OpenAI or Anthropic. The cheapest good AI right now is coming from Xiaomi, Z AI, xAI, DeepSeek, Inception, MiniMax, and KwaiKAT.
Flagship intelligence doesn't have to cost flagship prices
Here are models scoring 40+ on the Intelligence Index — the flagship tier — sorted by output price. The cheapest costs $0.30/1M output. The most expensive costs $80.00. Same benchmark tier.
| Model | Vendor | Intelligence Index | $/1M in | $/1M out |
|---|---|---|---|---|
| MiMo-V2-Flash | Xiaomi | 41.4 | $0.10 | $0.30 |
| DeepSeek V3.2 (Reasoning) | DeepSeek | 41.6 | $0.28 | $0.42 |
| GPT-5 mini (high) | OpenAI | 41.0 | $0.25 | $2.00 |
| Gemini 3 Flash Preview (Reasoning) | 46.4 | $0.50 | $3.00 | |
| Kimi K2.5 (Reasoning) | Kimi | 46.7 | $0.60 | $3.00 |
| GLM-5 (Reasoning) | Z AI | 49.6 | $1.00 | $3.20 |
| GPT-5 (high) | OpenAI | 44.6 | $1.25 | $10.00 |
| Claude Sonnet 4.6 (Adaptive Reasoning) | Anthropic | 51.3 | $3.00 | $15.00 |
| Claude Opus 4.6 (Adaptive Reasoning) | Anthropic | 53.0 | $5.00 | $25.00 |
| o3-pro | OpenAI | 40.7 | $20.00 | $80.00 |
Green rows: under $4/1M output. Gray rows: common defaults ($10+/1M output). All prices are list prices per million tokens from each vendor.
Six models clear the 40+ Intelligence Index bar for under $4/1M output. The popular defaults — from OpenAI, Anthropic, and others — deliver similar scores between $10 and $80/1M output.
The output price gap between the cheapest (MiMo-V2-Flash at $0.30/1M) and most expensive (o3-pro at $80.00/1M) flagship model is 267x. The Intelligence Index difference is 0.7 points — MiMo actually scores higher than o3-pro.
Where the defaults land
These are the models most teams use without thinking twice. Here's how their list prices compare to the value leaders.
| Model | II | $/1M in | $/1M out |
|---|---|---|---|
| o3-pro | 40.7 | $20.00 | $80.00 |
| DeepSeek V3.2 (Reasoning) | 41.6 | $0.28 | $0.42 |
| Claude Opus 4.6 (Adaptive Reasoning) | 53.0 | $5.00 | $25.00 |
| GLM-5 (Reasoning) | 49.6 | $1.00 | $3.20 |
| Claude 4.5 Haiku (Non-reasoning) | 31.0 | $1.00 | $5.00 |
| Grok 4 Fast (Reasoning) | 34.9 | $0.20 | $0.50 |
| GPT-4.1 | 25.6 | $2.00 | $8.00 |
| gpt-oss-120B (high) | 33.3 | $0.039 | $0.19 |
All prices are list prices per million tokens from each vendor. The MarginDash cost simulator lets you compare these models using your actual usage data.
In every pair, the alternative has dramatically lower list prices — and in most cases matches or beats the default on benchmarks. The o3-pro to DeepSeek V3.2 pair is the most dramatic: DeepSeek scores higher on Intelligence Index (41.6 vs 40.7) while charging $0.28/$0.42 vs $20.00/$80.00. GLM-5 scores 49.6 vs Claude Opus 4.6's 53.0 — slightly lower, but at $1.00/$3.20 vs $5.00/$25.00.
Many enterprise teams stay on these models for compliance, security, or API stability reasons — not because they've compared the alternatives. That's the legacy tax.
Before you swap everything
Benchmarks aren't the full picture. There are real reasons teams choose higher-priced models:
- API reliability and uptime. OpenAI and Anthropic have years of production API infrastructure. Newer providers may have less mature SLAs.
- Latency and reasoning overhead. Many of the value leaders in this list are reasoning models. They may have fast time-to-first-token, but total response time can be significantly longer because the model "thinks" before answering. For latency-sensitive applications like real-time chat, test actual end-to-end response times — not just benchmarks.
- Task-specific performance. Intelligence Index measures general reasoning. Your customer support chatbot might perform differently than the benchmarks predict. Always test on your own data.
- Ecosystem and tooling. SDK support, function calling, structured outputs, and documentation vary by vendor.
- Data residency and compliance. Some vendors may not meet your regulatory requirements.
The point isn't that you should switch to MiMo or GLM-5 tomorrow. It's that you should know what you're paying for — and whether the premium is justified by your actual requirements.
How we got these numbers
All pricing and Intelligence Index scores come from the MarginDash model database: 410 models across 43 vendors, synced daily from vendor pricing pages. Prices are list prices per million tokens — $/1M input and $/1M output — the industry-standard pricing unit published by each vendor.
All prices reflect standard real-time inference. Batch pricing, cached-input discounts, and volume agreements will shift the numbers — in some cases significantly.
You can explore all 410 models, filter by vendor, and run your own cost comparisons inside MarginDash — sign up free to access the model database and cost simulator.