
Qwen3-Max-Thinking is Alibaba’s new flagship reasoning model. It does not only scale parameters, it also changes how inference is done, with explicit control over thinking depth and built in tools for search, memory, and code execution.

Model scale, data, and deployment
Qwen3-Max-Thinking is a trillion-parameter MoE flagship LLM pretrained on 36T tokens and built on the Qwen3 family as the top tier reasoning model. The model targets long horizon reasoning and code, not only casual chat. It runs with a context window of 260k tokens, which supports repository scale code, long technical reports, and multi document analysis within a single prompt.
Qwen3-Max-Thinking is a closed model served through Qwen-Chat and Alibaba Cloud Model Studio with an OpenAI compatible HTTP API. The same endpoint can be called in a Claude style tool schema, so existing Anthropic or Claude Code flows can swap in Qwen3-Max-Thinking with minimal changes. There are no public weights, so usage is API based, which matches its positionin
Smart Test Time Scaling and experience cumulative reasoning
Most large language models improve reasoning by simple test time scaling, for example best of N sampling with several parallel chains of thought. That approach increases quality but cost grows almost linearly with the number of samples. Qwen3-Max-Thinking introduces an experience cumulative, multi round test time scaling strategy.
Instead of only sampling more in parallel, the model iterates within a single conversation, reusing intermediate reasoning traces as structured experience. After each round, it extracts useful partial conclusions, then focuses subsequent computation on unresolved parts of the question. This process is controlled by an explicit thinking budget that developers can adjust via API parameters such as enable_thinking and additional configuration fields.
The reported effect is that accuracy rises without a proportional increase in token count. For example, Qwen’s own ablations show GPQA Diamond increasing from around 90 level accuracy to about 92.8, and LiveCodeBench v6 rising from about 88.0 to 91.4 under the experience cumulative strategy at similar token budgets. This is important because it means higher reasoning quality can be driven by more efficient scheduling of compute, not only by more samples.
Native agent stack with Adaptive Tool Use
Qwen3-Max-Thinking integrates three tools as first class capabilities: Search, Memory, and a Code Interpreter. Search connects to web retrieval so the model can fetch fresh pages, extract content, and ground its answers. Memory stores user or session specific state, which supports personalized reasoning over longer workflows. The Code Interpreter executes Python, which allows numeric verification, data transforms, and program synthesis with runtime checks.
The model uses Adaptive Tool Use to decide when to invoke these tools during a conversation. Tool calls are interleaved with internal thinking segments, rather than being orchestrated by an external agent. This design reduces the need for separate routers or planners and tends to reduce hallucinations, because the model can explicitly fetch missing information or verify calculations instead of guessing.
Tool ability is also benchmarked. On Tau² Bench, which measures function calling and tool orchestration, Qwen3-Max-Thinking reports a score of 82.1, comparable with other frontier models in this category.
Benchmark profile across knowledge, reasoning, and search
On 19 public benchmarks, Qwen3-Max-Thinking is positioned at or near the same level as GPT 5.2 Thinking, Claude Opus 4.5, and Gemini 3 Pro. For knowledge tasks, reported scores include 85.7 on MMLU-Pro, 92.8 on MMLU-Redux, and 93.7 on C-Eval, where Qwen leads the group on Chinese language evaluation.
For hard reasoning, it records 87.4 on GPQA, 98.0 on HMMT Feb 25, 94.7 on HMMT Nov 25, and 83.9 on IMOAnswerBench, which puts it in the top tier of current math and science models. On coding and software engineering it reaches 85.9 on LiveCodeBench v6 and 75.3 on SWE Verified.
In the base HLE configuration Qwen3-Max-Thinking scores 30.2, below Gemini 3 Pro at 37.5 and GPT 5.2 Thinking at 35.5. In a tool enabled HLE setup, the official comparison table that includes web search integration shows Qwen3-Max-Thinking at 49.8, ahead of GPT 5.2 Thinking at 45.5 and Gemini 3 Pro at 45.8. With its most aggressive experience cumulative test time scaling configuration on HLE with tools, Qwen3-Max-Thinking reaches 58.3 while GPT 5.2 Thinking remains at 45.5, although that higher number is for a heavier inference mode than the standard comparison table.
Key Takeaways
Qwen3-Max-Thinking is a closed, API only flagship reasoning model from Alibaba, built on a more than 1 trillion parameter backbone trained on about 36 trillion tokens with a 262144 token context window.
The model introduces experience cumulative test time scaling, where it reuses intermediate reasoning across multiple rounds, improving benchmarks such as GPQA Diamond and LiveCodeBench v6 at similar token budgets.
Qwen3-Max-Thinking integrates Search, Memory, and a Code Interpreter as native tools and uses Adaptive Tool Use so the model itself decides when to browse, recall state, or execute Python during a conversation.
On public benchmarks it reports competitive scores with GPT 5.2 Thinking, Claude Opus 4.5, and Gemini 3 Pro, including strong results on MMLU Pro, GPQA, HMMT, IMOAnswerBench, LiveCodeBench v6, SWE Bench Verified, and Tau² Bench..
Check out the API and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

