Feeding an entire codebase, a 900-page legal contract, or three hours of meeting transcripts into a language model used to come with a painful catch: the longer your prompt, the more your bill exploded — and the slower everything got. On June 1, 2026, Chinese AI lab MiniMax released a model that attacks that exact problem. MiniMax M3 offers a 1 million token context window while cutting per-token compute at full context to roughly 1/20th of what its predecessor needed. That is not a marginal optimization; it changes the economics of long-context AI for developers.

If you build agents, retrieval pipelines, or anything that stuffs large documents into prompts, the design decisions behind MiniMax M3 are worth understanding — because the technique it uses, sparse attention over real key-value blocks, is likely to show up everywhere over the next year.

What Is MiniMax M3?

MiniMax M3 is a large language model released by MiniMax in June 2026 that combines a 1 million token context window, strong coding and agentic performance, and native multimodal input (text, images, and video). Its headline innovation, MiniMax Sparse Attention, reduces per-token compute at 1M-token context to about 1/20th of the previous generation’s cost.

That definition covers the what; the interesting part is the how. MiniMax positioned M3 as the first open-weight model to combine three things that previously only lived in proprietary flagships: top-tier coding ability, a million-token context, and multimodality trained in from the start rather than bolted on later. According to The Decoder’s coverage of the MiniMax M3 launch, the company committed to publishing the model weights and a technical report on Hugging Face and GitHub shortly after the API launch.

M3 follows MiniMax’s M2 line, which earned a reputation as a cost-efficient workhorse for agentic coding. M3 keeps that positioning but raises the ceiling dramatically on how much context you can throw at it — and how cheaply.

Why a 1 Million Token Context Window Matters

A context window is the amount of text (measured in tokens, roughly three-quarters of a word each in English) that a model can consider at once. One million tokens translates to roughly 750,000 words — about eight novels, a mid-sized codebase, or hundreds of pages of documentation in a single prompt.

Long context changes what you can build without extra infrastructure:

  • Whole-repository coding agents. Instead of retrieving a handful of “relevant” files and hoping the agent has enough context, you can load most of a project and let the model see actual cross-file dependencies.
  • Retrieval-light document analysis. For contracts, research papers, or compliance documents, you can often skip chunking and vector databases entirely and just include the full source material.
  • Long-horizon agent sessions. Agents that run for hours accumulate enormous histories of tool calls and observations. A bigger window means fewer lossy summarization steps mid-task.

Here is the catch that made million-token windows impractical until recently: standard transformer attention compares every token with every other token. Double the context and the attention work roughly quadruples. At 1M tokens, that quadratic blowup makes both latency and cost brutal — which is why most “long context” models were technically capable but economically unusable at full length. MiniMax M3 exists to fix the economics, not just the capability.

MiniMax Sparse Attention: How M3 Cuts Compute to 1/20th

The core of M3 is a mechanism called MiniMax Sparse Attention (MSA). Instead of computing attention scores between every pair of tokens, MSA selects the blocks of context that actually matter for the current token and attends only to those.

Block selection over real key-values

MSA works at the level of KV blocks — contiguous chunks of the key-value cache that the model builds as it reads your prompt. For each new token, the model scores which blocks are likely to be relevant and runs full attention only against the selected ones. Two design choices stand out:

  • No compression. MSA selects among real, uncompressed key-values rather than summarizing old context into lossy compressed states. Whatever the model attends to, it sees at full fidelity. This contrasts with approaches that shrink the KV cache and accept some information loss.
  • Layered on Grouped-Query Attention. MSA sits on top of a standard GQA backbone, so it composes with the serving optimizations inference providers already use. If you want a refresher on how baseline attention works, the Wikipedia overview of attention in machine learning is a solid grounding.

The performance numbers

MiniMax reports three headline efficiency figures at 1M-token context, all relative to its previous generation:

  • Per-token compute reduced to roughly 1/20th
  • Prefill (reading your prompt) more than 9x faster
  • Decoding (generating the response) more than 15x faster

A useful analogy: full attention is like rereading an entire textbook every time you answer one exam question. Sparse attention is flipping straight to the three chapters you bookmarked because you know the answer lives there. You do dramatically less work, and as long as your bookmarking is good, your answers stay just as accurate. The open question for any sparse method — and the thing to verify in your own testing — is how often the block selector misses a chapter it actually needed.

The big idea is not that M3 can handle 1M tokens — several models can. It is that M3 makes 1M tokens cheap enough and fast enough to use routinely instead of as a party trick.

MiniMax M3 Benchmarks: How It Stacks Up

MiniMax’s launch numbers place M3 in genuinely competitive territory with proprietary frontier models on agentic and coding tasks. Treat these as vendor-reported until independent evaluations accumulate, but the claims are specific enough to assess:

Benchmark MiniMax M3 Claude Opus 4.7 GPT-5.5
SWE-Bench Pro (real-world software fixes) 59.0% 64.3% 58.6%
BrowseComp (autonomous web research) 83.5 79.3

Reading the table honestly: M3 edges out GPT-5.5 on SWE-Bench Pro while trailing Claude Opus 4.7, and it posts a strong autonomous web-research score. For an open-weight release priced at a fraction of either competitor, sitting between two proprietary flagships on real-world coding is the story.

MiniMax also published long-horizon demonstrations: M3 reproduced a research paper’s results autonomously over a 12-hour session (scoring 0.650 on the reproduction metric) and iterated on GPU kernel optimization for 24 hours, reaching 71.3% hardware utilization after 147 attempts. These showcase exactly the workload the million-token window enables — agents that keep their full working history in context across very long sessions.

One honest caveat from the launch materials: M3 scores under 12% on ARC-AGI-2, an abstract reasoning benchmark. It is a coding and agentic specialist, not a general-reasoning record-setter.

MiniMax M3 Pricing: What 1/20th the Compute Means for Your Bill

Architecture efficiency only matters to you if it shows up in the per-token price. It does. At launch, M3 is available with promotional pricing of $0.30 per million input tokens and $1.20 per million output tokens (standard rates: $0.60 and $2.40) through the MiniMax API and aggregators like OpenRouter’s MiniMax M3 listing.

For comparison, Claude Opus 4.7 charges $5.00 per million input tokens and $25.00 per million output tokens — making M3 more than 15x cheaper on input at promotional rates. A concrete scenario: analyzing a 500,000-token codebase with a 5,000-token response costs about $0.16 on M3’s promo pricing versus roughly $2.63 on Opus 4.7. Run that analysis a hundred times a day in an agent loop and the difference funds an engineer’s salary.

Two pricing details to plan around:

  • The standard rate applies to prompts up to 512K tokens; contexts beyond that incur a higher tier, which is common practice for long-context models.
  • The 512K context is the guaranteed minimum across providers; the full 1M window depends on which provider and tier you use.

Getting Started with the MiniMax M3 API

M3 is exposed through OpenAI-compatible endpoints, so if you have ever called a chat completions API, you already know the workflow. Here is a minimal Python example using OpenRouter:

from openai import OpenAI

# OpenRouter exposes MiniMax M3 via an OpenAI-compatible API
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_API_KEY",
)

# Load a large document — M3 can take up to 1M tokens of context
with open("entire_codebase_dump.txt", "r", encoding="utf-8") as f:
    codebase = f.read()

response = client.chat.completions.create(
    model="minimax/minimax-m3",
    messages=[
        {
            "role": "system",
            "content": "You are a senior code reviewer. Be specific and cite file names.",
        },
        {
            "role": "user",
            # The full codebase goes directly into context — no chunking needed
            "content": f"Review this codebase for security issues:\n\n{codebase}",
        },
    ],
    max_tokens=4000,
)

print(response.choices[0].message.content)

This script reads a large file (a concatenated codebase dump, for example), sends it in a single request, and asks for a security review. The point of the example is what is missing: no chunking logic, no embedding pipeline, no retrieval step. With a million-token window, “context engineering” for many tasks collapses to “put the whole thing in the prompt.”

Before you ship anything, estimate costs. A quick back-of-envelope helper:

def estimate_cost(input_tokens: int, output_tokens: int) -> float:
    """Estimate a MiniMax M3 request cost at promotional rates."""
    INPUT_PER_M = 0.30   # USD per million input tokens (promo)
    OUTPUT_PER_M = 1.20  # USD per million output tokens (promo)
    return (input_tokens / 1e6) * INPUT_PER_M + (output_tokens / 1e6) * OUTPUT_PER_M

# A full 1M-token prompt with a 4K-token answer
print(f"${estimate_cost(1_000_000, 4_000):.4f}")  # ≈ $0.3048

The function multiplies token counts by the per-million rates and sums them. Even a maxed-out million-token request costs about thirty cents at promotional pricing — which is what “1/20th the compute” looks like when it reaches your invoice. Remember that contexts above 512K tokens may bill at a higher tier, so check your provider’s current rate card.

Limitations and Mistakes to Avoid

M3 is impressive, but treating any launch-week model as a drop-in replacement for your current stack is how production incidents happen. Watch for these pitfalls:

  • Do not assume long context equals perfect recall. Sparse attention bets that the block selector finds the right context. For most tasks it does; for needle-in-a-haystack retrieval where a single sentence buried at token 700,000 decides the answer, test before trusting. Build a small evaluation set from your real documents.
  • Vendor benchmarks are vendor benchmarks. The SWE-Bench Pro and BrowseComp numbers come from MiniMax’s launch materials. Independent reproductions usually land close-ish but rarely identical. Pilot the model on your own workload.
  • Do not skip the cost-tier fine print. The attractive rate covers contexts up to 512K tokens. If your pipeline routinely sends 800K-token prompts, model your costs at the higher tier, not the headline price.
  • Mind data governance. MiniMax is a China-based lab. For regulated industries, review where API traffic is processed and whether self-hosting the open weights (once published on MiniMax’s Hugging Face organization) better fits your compliance requirements.
  • Promotional pricing ends. Architect your budget around the standard $0.60/$2.40 rates so the promo expiring is a pleasant memory, not a budget crisis.

Frequently Asked Questions About MiniMax M3

What is the context window of MiniMax M3?

MiniMax M3 supports up to 1 million tokens of context, with 512K tokens guaranteed across serving providers. One million tokens is roughly 750,000 English words — enough for a large codebase or several books in a single prompt.

How does MiniMax M3 achieve 1/20th the compute cost?

Through MiniMax Sparse Attention, which selects the most relevant blocks of the key-value cache for each token instead of attending to every token pair. Because it skips most of the quadratic attention work while keeping selected context uncompressed, per-token compute at 1M context drops to about 1/20th of the previous generation, with 9x faster prefill and 15x faster decoding.

Is MiniMax M3 open source?

It is open weight: MiniMax committed to publishing the model weights and technical report on Hugging Face and GitHub shortly after the June 1, 2026 API launch. Open weight means you can download and self-host the model, though “open source” in the strict licensing sense depends on the final license terms — check them before commercial self-hosting.

How much does the MiniMax M3 API cost?

Launch promotional pricing is $0.30 per million input tokens and $1.20 per million output tokens, with standard rates of $0.60 and $2.40. That applies to contexts up to 512K tokens; longer prompts bill at a higher tier. Even at standard rates, M3 undercuts proprietary flagships by an order of magnitude.

Is MiniMax M3 better than GPT-5.5 or Claude Opus 4.7?

On vendor-reported numbers, M3 slightly beats GPT-5.5 on SWE-Bench Pro (59.0% vs 58.6%) and trails Claude Opus 4.7 (64.3%), while leading on the BrowseComp web-research benchmark. “Better” depends on your task: for cost-sensitive, long-context agentic work, M3 is extremely competitive; for abstract reasoning (it scores under 12% on ARC-AGI-2) or maximum coding accuracy regardless of price, proprietary flagships still hold the edge.

Can MiniMax M3 process images and video?

Yes. M3 accepts text, image, and video inputs and produces text output. MiniMax trained it on mixed-modality data from the start rather than retrofitting vision onto a text model, which generally yields stronger cross-modal understanding.

Conclusion

MiniMax M3 matters less for any single benchmark score and more for the constraint it removes. A 1 million token context window at 1/20th the compute cost turns long-context AI from a premium feature into a default tool — one where loading an entire repository, contract archive, or day-long agent history into a prompt costs cents, not dollars.

The key takeaways: MiniMax Sparse Attention delivers the efficiency by attending only to selected, uncompressed KV blocks; vendor benchmarks place M3 between GPT-5.5 and Claude Opus 4.7 on real-world coding while costing a small fraction of either; and the model ships with native multimodality and a planned open-weight release. The sensible next step is a low-stakes pilot — point the OpenAI-compatible API at one of your real long-context workloads, run your own recall and quality checks, and let your own numbers, not launch-week headlines, decide whether MiniMax M3 earns a place in your stack.