Gemma 4 is out, and benchmarks are the least interesting part

Google released Gemma 4 this week with four model sizes: a 5B dense model, an 8B dense model, a 26B mixture-of-experts variant with 4B active parameters, and a 31B dense model that reportedly matches the Qwen 3.5 27B. Nathan Lambert at Interconnects wrote the most useful analysis of the release, and it barely touches benchmark scores. His argument is that what makes an open model succeed in practice comes down to five things: performance relative to size, country of origin, licensing terms, tooling support at release, and how easily the model can be fine-tuned. Raw benchmark numbers fall somewhere below all of them in the ranking that matters.

The licensing shift is the most concrete change with Gemma 4. The previous Gemma models shipped with restrictive terms that limited commercial use in ways that made enterprise adoption difficult. Gemma 4 uses Apache 2.0 across the board. That is not a minor detail. It removes the legal friction that has made Chinese open labs like Alibaba (Qwen) and DeepSeek attractive to organisations that need clean provenance and no usage caps. Google has been giving ground to those competitors on licensing for over a year, and this catches up some of that gap.

Lambert’s most honest observation is about tooling: even after a well-received release, production-ready support in frameworks like vLLM or Hugging Face Transformers can lag by weeks or months. The 26B MoE architecture in particular will take time to stabilise across inference stacks. This is a recurring pattern with architecturally novel models. A model that scores well on MMLU but takes six weeks before you can run it reliably in a serving environment is not production-ready, regardless of what the leaderboard says. Lambert puts the gap at 1.5 months for complex architectures, which feels accurate based on how past releases have played out.

The fine-tunability question is the least resolved of the five factors, and Lambert is honest about that. There is no systematic way to measure how well a base model responds to instruction tuning or domain adaptation before you try it. The community will work this out over the coming months through sheer experimentation. If Gemma 4 fine-tunes well at the 27B range, that is the tier where the interesting production use cases live: large enough to be capable, small enough to be economically sensible to run. That is where the real test happens.

If you are evaluating open models right now, Gemma 4 belongs on your list for the same reasons Lambert identifies: U.S. provenance, Apache 2.0, credible performance numbers, and a team at Google with the incentive to keep tooling maintained. The benchmark comparisons will resolve themselves. The factors that determine whether you can actually deploy it in six months are mostly already known.