When the AI Decides to Think

by Patrix | Apr 30, 2026

Something about Qwen3.6 stuck with me when I read through the release notes: it has a switch. Not a metaphorical one. A literal toggle between thinking mode and non-thinking mode. You decide, per conversation, whether you want the model to reason through a problem or just respond.

That sounds like a minor feature. It’s not.

Most local AI models are a single gear. You give them a prompt, they generate tokens, done. Qwen3.6 introduces a hybrid reasoning mode that works through complex problems step by step, then carries that reasoning trace into the next turn if you want it to. Or you can turn it off entirely for faster, more conversational responses.

The mechanics are cleaner than they sound. Thinking mode is for hard problems: code, math, reasoning chains where you want the model to show its work before committing to an answer. Non-thinking mode is for everything else, where speed matters more than depth. And there’s a third option called “Preserve Thinking” that keeps the reasoning trace alive across the entire conversation, so the model builds on what it already reasoned through rather than starting cold each turn. In practical terms, that means fewer tokens spent re-deriving context — reportedly around a 40% reduction in token usage on complex agentic workflows, with no measurable accuracy loss.

The architecture behind this is worth a brief look. Qwen3.6 combines linear attention with sparse mixture-of-experts routing. What that means in practice: it retains context more efficiently than standard attention models, which is why you get 256K context without the usual degradation at longer sequences. The thinking toggle sits on top of this architecture rather than requiring a separate model endpoint. One model, one download, two modes. That keeps the implementation clean in a way that matters when you’re running this locally.

I’ve been setting up local models on the Mac Mini M4 Pro, and the 27B variant is the one worth paying attention to here. At 4-bit quantization it weighs 18GB, which fits comfortably in 24GB unified memory. On the M4 Max, the closest available benchmark to the M4 Pro, Q4_K_M quantization runs at around 16 tokens per second. Fast enough for real work. One practical note if you’re pulling GGUFs manually: use Q4_K_M, not IQ4_XS. There’s a known llama.cpp/Metal regression that drops IQ4_XS performance to around 5 tokens per second on Apple Silicon. Q4_K_M sidesteps it entirely.

The benchmark numbers are also unusual for a 27B model. On SWE-bench Verified, a coding test that involves actually solving real GitHub issues, Qwen3.6-27B scores 77.2%. That matches or beats Qwen’s own 397B parameter model on major agentic coding benchmarks, despite having 14 times fewer total parameters. The architecture is doing real work there, not just compressing the same capability into a smaller shell.

One honest caveat: Qwen3.6 doesn’t run in Ollama yet. The multimodal components use separate projection files that Ollama’s current architecture doesn’t handle. You’ll need llama.cpp or Unsloth Studio. Unsloth Studio installs with a single curl command and auto-configures inference parameters when you select the model. MLX quants are also available for Apple Silicon if you want a more native Mac experience, though llama.cpp with Metal support is fast enough that the practical difference is small.

The question that interests me more than the setup: when do you actually want an AI to think? Not as a philosophical exercise. As a practical decision you make before sending each message. Thinking mode takes longer and burns more tokens. Non-thinking mode is faster but shallower. Most of the time the right answer is obvious in retrospect, but Qwen3.6 forces you to have an opinion about it upfront.

Most AI tools don’t ask you to think about how they think. This one does. You’re not just prompting — you’re choosing whether to engage the model’s reasoning machinery or route around it. That’s an unfamiliar position if you’re used to treating local models as fast autocomplete. For structured tasks with clear requirements, thinking mode earns its slower response time by reasoning through edge cases the model would otherwise skip. For quick Q&A, non-thinking mode is the right call. The sweet spot I’m most curious about is agentic workflows, where toggling thinking mode per subtask could cut token costs significantly without sacrificing quality on the steps that actually need careful reasoning.

Worth installing. Worth testing. Particularly if you have a 24GB Mac and you’ve been looking for a local model that doesn’t feel like a compromise.