Identity
Benchmarks
| Benchmark | M2.7 | K2.5 | Verdict |
|---|---|---|---|
| SWE-ProHard real-world SW engineering | 56.2% | — | M2.7 |
| SWE-Bench VerifiedGitHub issue resolution | ~55%* | 76.8% | K2.5 |
| LiveCodeBenchCompetitive programming | — | 85.0% | K2.5 |
| Terminal-Bench 2Complex engineering systems | 57.0% | — | M2.7 |
| VIBE-ProEnd-to-end project delivery | 55.6% | — | M2.7 |
| HLE (with tools)Humanity's Last Exam | — | 50.2% | K2.5 |
| BrowseCompWeb navigation & search | — | 78.4% | K2.5 |
| GDPval-AAOffice productivity ELO | 1495 | — | M2.7 |
| Skill AdherenceComplex skill following (40+ skills) | 97% | — | M2.7 |
| MMMU ProMultimodal academic understanding | N/A (text only) | 78.5% | K2.5 |
| MathVisionVisual math reasoning | N/A | 84.2% | K2.5 |
| AA Intelligence IndexArtificial Analysis composite | 50 | 47 | M2.7 |
* M2.7 is text-only and not benchmarked on many vision/agentic tasks that K2.5 excels at. "—" means no published score. Scores from official reports and Artificial Analysis.
Pricing
Agentic Architecture
M2.7 focuses on self-evolving harness engineering — the model refines its own scaffolding, memory, skills, and tool-selection loops across 100+ autonomous improvement cycles. It's a single powerful agent that gets better at using its own environment.
K2.5 focuses on swarm coordination — spawning up to 100 parallel sub-agents, each with independent tool access, to decompose and conquer complex tasks. It's a coordinator that scales horizontally.
M2.7's 97% adherence on 40+ complex skills (>2000 tokens each) is the headline number. It was literally trained to build and optimize its own agent harness. For harness engineering — the layer you care most about — M2.7 is purpose-built.
K2.5 handles 200–300 sequential tool calls without drift, which is impressive for long-horizon tasks, but doesn't report comparable skill-following metrics.
Agent Swarm's 100 parallel sub-agents with 1,500 coordinated steps delivers 4.5× speedup on parallelizable tasks. For research, batch processing, and multi-source analysis, this is a game-changer.
M2.7 operates as a single-threaded agent loop. Fast per-token (100 tok/s with highspeed), but fundamentally serial.
M2.7 is the first model to demonstrably participate in its own training loop — updating memory, building skills, running RL experiments, and refining its own harness. 30–50% of its development workflow was self-directed.
K2.5 doesn't claim self-improvement capabilities. Its swarm agents are disposable and stateless.
K2.5 has native vision (MoonViT, 400M params) trained on 15T mixed visual+text tokens. It processes images, video, PDFs, and does vision-grounded coding (Figma → React).
M2.7 is text-only. No image input, no video understanding. If your agents need to see, M2.7 can't help.
K2.5 is fully open-weight under Modified MIT. Deploy on your own infra with vLLM/SGLang. Full data sovereignty. Commercial use free under 100M MAU / $20M MRR.
M2.7 is proprietary API-only. You're locked into MiniMax's infrastructure. No self-hosting, no weight access.
Strengths & Weaknesses
MiniMax M2.7
Strengths
- Insane cost efficiency — 50× cheaper than Opus on input, near frontier on SWE-Pro
- 97% skill adherence makes it the most reliable harness backend available
- Self-evolving architecture — model improves its own scaffolding
- Best office productivity (ELO 1495) — Excel, PPT, Word editing
- Highspeed variant (~100 tok/s) for latency-sensitive agent loops
- Compatible with Claude Code, Cursor, Kilo Code, Roo Code as a backend
Weaknesses
- Text-only — zero vision capability, can't process images/video
- Proprietary & closed — no self-hosting, API lock-in
- 205K context is smaller than K2.5's 256K
- Very verbose output (~87M tokens on AA eval) — burns tokens
- Chinese censorship on politically sensitive topics
- Standard mode is slower at 48 tok/s
Kimi K2.5
Strengths
- Native multimodal — image, video, PDF input via MoonViT
- Agent Swarm: 100 parallel sub-agents, 4.5× speedup on batch tasks
- Best open-source coding model (76.8% SWE-Bench, 85% LiveCodeBench)
- Open weights (Modified MIT) — full self-hosting with vLLM/SGLang
- Dominant on web navigation (BrowseComp 78.4%, beat GPT-5.2)
- 4 operational modes: Instant, Thinking, Agent, Agent Swarm
- Vision-to-code: Figma mockup → React/Vue components
Weaknesses
- 2× more expensive than M2.7 on blended token cost
- Slow median response time (29.2s vs 4.6s competitors)
- 1T params means self-hosting requires serious GPU infra
- English prose quality rated ~8.5/10 vs 9/10 for GPT
- Chinese censorship on political content
- Moonshot accused by Anthropic of training data scraping (Feb 2026)
- Weaker ecosystem/community presence in Western markets
Verdict for AI Agent Builders
The Bottom Line
Pick M2.7 When
You need a cheap, reliable harness backend that follows complex skill instructions faithfully. Ideal for coding agents, office automation pipelines, agent orchestration where the model is a cog in your harness — not the orchestrator itself. At $0.30/$1.20 per 1M tokens with near-Opus coding quality, it's the best bang-for-buck reasoning engine for serial agentic workflows. The self-evolving harness pattern is genuinely novel and aligns with the "harness is the moat" philosophy.
Pick K2.5 When
You need multimodal perception + parallel execution. If your agents need to see (screenshots, documents, video), K2.5 is the only choice here. Agent Swarm unlocks massively parallel research, web crawling, and batch processing that a single-agent loop simply can't match. Open weights mean full data sovereignty for enterprise deployments. The vision-to-code pipeline (Figma → frontend) is production-ready. Best fit for autonomous research agents and multi-source analysis.