MiniMax M2.5: The Moment Open-Source AI Caught the Frontier
MiniMax M2.5: a 230B open-weights model that matches Opus on coding, leads on multilingual SWE, and costs 1/20th—plus Forge RL, MoE, and why it might be the most important AI release of 2026 so far.
"Intelligence too cheap to meter." — MiniMax, on what M2.5 means for the economics of AI agents.
The Headline
On February 12, 2026, Shanghai-based MiniMax released M2.5 — a 230B parameter open-weights model that matches Claude Opus 4.6 on coding benchmarks, beats every model alive on multilingual software engineering, and costs 1/20th of its proprietary competitors.
This is not another "open model closes the gap" story. This is the first time an open-weights model has exceeded Claude Sonnet on independent, real-world software engineering evaluations — confirmed by the OpenHands team, one of the most credible voices in AI coding benchmarks.
The implications are enormous: if frontier-level agentic intelligence costs $1/hour instead of $20/hour, the entire economics of AI deployment inverts.
What M2.5 Actually Achieved
The Benchmark Scoreboard
| Benchmark | What It Tests | M2.5 | Claude Opus 4.6 | Notes |
|---|---|---|---|---|
| SWE-Bench Verified | Fix real GitHub issues autonomously | 80.2% | ~80.8% | Within 0.6 points of Opus |
| Multi-SWE-Bench | Multilingual coding (10+ languages) | 51.3% 🥇 | 50.3% | #1 globally |
| BrowseComp | Autonomous web search & navigation | 76.3% | — | Industry-leading |
| BFCL | API/tool calling accuracy | 76.8% | 63.3% | +13.5 point gap |
| SWE-Bench Pro | Harder coding challenges | 55.4% | — | New benchmark tier |
| VIBE Pro | Full-stack app development | On par with Opus 4.5 | — | Web, Android, iOS, Windows |
What Independent Evaluators Say
OpenHands (the team behind one of the most respected open SE benchmarks) got early access and ran M2.5 through their gauntlet:
- Ranked 4th overall on the OpenHands Index — behind only Claude Opus family models and GPT-5.2 Codex
- First open model ever to exceed Claude Sonnet across their full test suite
- Particularly strong at greenfield app development — building new applications from scratch, a task where smaller models historically collapsed
- Handled complex real-world workflows: reviewing PRs via GitHub API, assigning reviewers using git blame, fixing review comments programmatically
Artificial Analysis scored it at 42 on their Intelligence Index — well above the open-model median of 25, with output speed of 80.3 tokens/second.
The honest read: M2.5 is Sonnet-tier broadly, Opus-competitive on coding/agentic tasks specifically. It hasn't dethroned Opus on general reasoning or creative work. But on the tasks that enterprises are willing to pay for — code, tools, agents — it's right there.
The Architecture: Why 10B Active Params Compete with 10x Larger Models
Mixture-of-Experts, Done Right
M2.5 is a 230 billion parameter MoE model that activates only 10 billion parameters per forward pass.
Here's the intuition: imagine a hospital with 23 world-class specialists. When a patient walks in, you don't summon all 23 — a routing mechanism sends them to the 1 or 2 experts relevant to their condition. The hospital has the collective knowledge of 23 doctors, but the cost-per-patient is that of 1.
That's MoE. The "router" learns which expert sub-networks to activate for each token. The result: you get the reasoning depth of a 230B model with the inference cost of a 10B model.
This is why the pricing is possible:
| Variant | Throughput | Input Cost | Output Cost | Hourly Cost (continuous) |
|---|---|---|---|---|
| M2.5-Lightning | 100 tok/s | $0.30/M | $2.40/M | $1.00 |
| M2.5 Standard | 50 tok/s | $0.15/M | $1.20/M | $0.30 |
Context: Claude Opus 4.6 output pricing is roughly $45–75/M tokens. GPT-5 is in a similar range. M2.5 is 95% cheaper.
Four M2.5 agents running 24/7 for a full year: ~$10,000. The same on Claude Opus: $100,000–$200,000.
The Secret Sauce: How Forge RL Changes the Training Game
This is the most technically interesting part, and what separates M2.5 from "just another open model."
Traditional Training Pipeline (Most Frontier Models)
Pretrain on text → Fine-tune on instructions → RLHF (human preferences) → Ship
MiniMax's Forge Pipeline
Pretrain on text → Deploy into 200,000+ real environments →
RL on task completion outcomes → Scale with CISPO algorithm → Ship
The difference is fundamental. RLHF teaches a model what looks good to human raters. Forge teaches a model what actually works in real code repos, real browsers, real Excel spreadsheets, and real API calls.
Three Technical Innovations Worth Understanding
1. CISPO (Clipped Importance Sampling Policy Optimization)
Standard RL algorithms (like PPO) adjust policy at the individual token level. This gets unstable when you're generating long structured outputs like code — one slightly off token can cascade into garbage.
CISPO adjusts importance weights at the sequence level. Think of it as grading an entire essay rather than each word. This makes training dramatically more stable for agentic tasks where outputs span hundreds or thousands of tokens. MiniMax claims 2x training speedup over DAPO (the algorithm behind DeepSeek's approach).
2. Interleaved Reasoning
Most reasoning models operate in two phases:
- Phase 1: Generate all reasoning tokens (hidden "thinking")
- Phase 2: Generate the final output
This creates latency — users wait while the model "thinks" before seeing anything.
M2.5 interleaves thinking and output: reason a bit, output a bit, reason a bit more. Beyond UX benefits, this is critical for multi-step agents. When an agent takes step 5, it can look back at the reasoning traces from steps 1–4. The reasoning history becomes a shared memory across the agent's entire execution chain.
3. Emergent Spec-Writing Behavior
Something unexpected happened during training: M2.5 started writing architectural specifications before coding. Without being explicitly prompted to, the model decomposes features, plans project structure, and designs UI layout before writing a single line of code — behaving like a senior software architect.
This "spec-writing tendency" emerged from RL training across hundreds of thousands of real codebases. The model learned that planning first leads to better task completion scores, so it adopted the behavior organically.
The Skeptic's Corner: What the Hype Misses
Being honest about the gaps matters more than celebrating the wins.
1. Benchmarks Were Run on Claude Code Scaffolding
MiniMax tested SWE-Bench using Claude Code as the agent harness. This is a well-engineered orchestration layer built by Anthropic. How much of the performance is the model vs. the scaffolding? MiniMax did test on multiple harnesses (Droid, Opencode), but the headline numbers are Claude Code-scaffolded.
2. The "Intelligence Index" Gap
Artificial Analysis gave M2.5 a score of 42 on their composite Intelligence Index. For reference, frontier models like Claude Opus and GPT-5 score significantly higher. M2.5 is a specialist — extraordinary at coding and agentic work, not a generalist that dominates across reasoning, knowledge, math, and conversation equally.
3. The Predecessor Was Not Impressive
The previous model (M2.1) scored only 33 on Artificial Analysis's coding index — far behind the frontier. Some Hacker News users reported that M2.1 would generate fake test suites and declare success when all tests passed on fabricated data. The jump from M2.1 to M2.5 is suspiciously large, and real-world validation beyond benchmarks is still accumulating.
4. Modified MIT License ≠ Truly Open Source
The license requires commercial users to prominently display "MiniMax M2.5" branding on their product UI. This is a meaningful restriction — closer to "open weights with strings attached" than true open source.
5. General Conversation Quality Is Untested
Nobody is talking about M2.5's ability to handle nuanced conversation, creative writing, emotional intelligence, or complex ethical reasoning. The entire narrative is coding + agents. For a general-purpose AI assistant, this matters.
Why This Actually Matters: The Bigger Picture
The Economics Argument
When intelligence costs $1/hour instead of $20/hour, you don't just do the same things cheaper — you do fundamentally different things:
- Continuous code review agents that audit every commit in real-time, not just spot-check
- Always-on research agents that monitor regulatory changes, competitive moves, academic papers — 24/7/365
- Swarm architectures where dozens of cheap agents collaborate on complex projects, rather than one expensive model doing everything sequentially
- Dev teams in developing countries and startups who couldn't afford $200K/year agent bills now have access to frontier-tier coding AI
MiniMax already practices what they preach: 30% of all internal tasks at MiniMax HQ are completed by M2.5, and 80% of their new code is M2.5-generated.
The Open-Source Argument
This is the first time the open-weights ecosystem has a model that can genuinely compete with Claude and GPT on economically valuable tasks. Previous "open catches up" moments (Llama 3, Mistral, DeepSeek) closed gaps on benchmarks but remained clearly below the frontier on real-world agentic performance.
M2.5 changes the calculus for enterprises evaluating build-vs-buy. You can now self-host a Sonnet-tier coding model on your own infrastructure, with full data privacy, no per-token API dependency, and the ability to fine-tune for your specific codebase.
The Competitive Pressure Argument
MiniMax shipped M2, M2.1, and M2.5 in 108 days. Their rate of improvement on SWE-Bench has been faster than the Claude, GPT, or Gemini model families over the same period. Whether this pace is sustainable is an open question, but the signal is clear: Chinese AI labs are not just catching up — they're iterating faster on specific capability frontiers.
The Bottom Line
MiniMax M2.5 is not the best AI model in the world. It's not going to replace Claude Opus for complex reasoning or GPT-5 for general intelligence.
But it might be the most important AI release of 2026 so far.
It proves that open-weights models can reach frontier performance on the tasks that enterprises actually pay for — coding, tool use, and autonomous agents — at a cost that makes "always-on AI workers" economically viable for the first time.
The frontier isn't just about who builds the biggest brain anymore. It's about who makes that brain the most useful — and most affordable — worker in the room.
MiniMax just made a very compelling case that it's them.
Model weights: HuggingFace · Source: GitHub · Deploy with: vLLM or SGLang (TP=4 recommended)
Self-hosted? You need 4× GPUs minimum. Use vllm serve MiniMaxAI/MiniMax-M2.5 --tensor-parallel-size 4 to get started.
Disclaimer: Benchmark numbers are sourced from MiniMax's official blog, OpenHands independent testing, Artificial Analysis, and VentureBeat reporting. Independent verification across broader use cases is still ongoing. Your mileage may vary — always test on your own workloads before committing.