Skip to content
Back to articles

Inside Gemma 4: What Google's Open Models Can Do

April 3, 2026/6 min read/1,194 words
Google DeepMindOpen SourceMachine LearningGenerative AI
Sam Witteveen presenting Gemma 4 with model architecture details on screen
Image: Screenshot from YouTube.

Key insights

  • The Apache 2.0 license may matter more than the models themselves. Previous Gemma versions had restrictions that pushed developers toward Llama or Qwen. That barrier is gone.
  • The MoE workstation model delivers roughly 27B-class intelligence at the compute cost of a 4B model. 26B total parameters, but only 3.8B active at any one time.
  • Built-in multimodal capabilities mean no more bolting on Whisper for audio or wiring up separate vision tools. Everything is native, from the architecture up.
  • The E2B edge model runs on a T4 GPU with 128K context, vision, audio, and thinking. Real AI on small hardware is becoming practical.
SourceYouTube
Published April 2, 2026
YouTube
YouTube
Hosts:Sam Witteveen

This is an AI-generated summary. The source video may include demos, visuals and additional context.

Watch the video · How the articles are generated

In Brief

Google released Gemma 4 on April 2, four new open models with native reasoning, vision, audio, and function calling. Sam Witteveen, co-founder of Red Dragon AI and a Google Developer Expert in machine learning, ran through the architecture and tested the smallest model live. The story has two layers: impressive technical specs, and a licensing change that removes the last excuse for not using Google's open models commercially.

The license is the headline

Multimodality and thinking are expected at this point. What caught Witteveen's attention immediately was something more mundane-sounding: the license.

"This is an actual real Apache 2.0 license, which means for the first time you can take Google's best open model, modify it, fine-tune it, deploy it commercially, do whatever you want with it. No strings attached."

Apache 2.0 is the kind of permission slip that lawyers and developers both love. It says: use this however you want, in any product, for any profit. Previous Gemma versions came with custom licenses that included "open weights but don't compete with us" clauses. Witteveen noted that many developers chose Llama or Qwen simply because of those restrictions. That calculation just changed.

The timing is not accidental. Some Chinese open model providers have recently pulled back their latest releases and stopped making them open. Google is moving in the opposite direction.

Two tiers, four models

Gemma 4 comes in two families, built for different hardware realities.

Workstation models are designed for developer machines, servers, and coding assistants:

  • Gemma 4 MoE (26B total): 26 billion total parameters, but only 3.8 billion active at any time. The architecture uses 128 tiny specialists internally, with 8 activated per token plus one shared expert that's always on. Context window: 256K tokens.
  • Gemma 4 31B (Dense): A traditional model where all parameters work on every response. Fewer layers than Gemma 3, but with architectural upgrades including value normalization and an improved attention mechanism. Also 256K context window.

Edge models are designed to run locally on small devices:

  • E2B and E4B: Small enough for phones, Raspberry Pis, and Jetson Nanos. Both support audio natively, which the workstation models do not. Context window: 128K tokens.

A context window is how much the model can "remember" in a single conversation. 128K tokens is roughly one full book. 256K is two. If you've ever chatted with an AI and noticed it starting to forget what you talked about, the context window was full.

The benchmarks for the 31B dense model are strong: 85.2% on MMLU (broad knowledge), 89.2% on AIME 2026 (math), and 80.0% on LiveCodeBench v6 (code generation).

What "mixture of experts" actually means

The MoE architecture deserves a proper explanation, because it is genuinely clever.

Imagine a school with 128 specialist teachers. Every time a student asks a question, the principal decides which 8 teachers are most relevant and routes the question to them. The other 120 stay quiet and consume no energy. The student still gets expert-level answers, but the school runs on a fraction of the cost.

That is roughly how the Gemma 4 MoE model works. 26 billion parameters exist in total, but only 3.8 billion are active for any given token. Witteveen put it plainly:

"Roughly this is giving you sort of the intelligence of a 27B model with the compute costs of something around a 4B model."

You can run this on a consumer graphics processing unit (GPU). Google is also releasing QAT (Quantization-Aware Training) checkpoints, which keep quality high even when the model is compressed to use less memory.

Everything built in, nothing bolted on

Before Gemma 4, building a local AI assistant that could listen, see, and use tools required assembling multiple systems. You would run a model for text, add Whisper for audio transcription, wire up a separate vision model, and then hope the whole thing stayed in sync.

Gemma 4 ships all of this natively. Four capabilities are built in from the architecture level:

Reasoning: Long chain-of-thought reasoning across text, images, and audio. You can switch it on or off per request with a single flag in the application programming interface (API) call.

Function calling: The model can invoke external tools mid-conversation. Earlier approaches trained models to follow instructions and hoped they would cooperate with the tool format. Gemma 4 has function calling baked in from scratch, optimized for multi-turn agentic flows with multiple tools running in sequence.

Vision: Native support for multiple images in a single conversation, with proper handling of different aspect ratios. The new vision encoder understands image dimensions correctly, which makes it far more useful for document understanding and OCR.

Audio (edge models only): The E2B and E4B models include a built-in speech recognition encoder. The encoder is 50% smaller than in Gemma 3N, dropping from 681 million to 305 million parameters, and from 390 MB to 87 MB on disk. Frame duration improved from 160ms to 40ms, making transcription noticeably more responsive.

All four capabilities sit on top of a multilingual base: 140 languages in pre-training, 35 languages in instruction fine-tuning.

What the smallest model can do

Witteveen tested the E2B, the smallest model in the family, running on a T4 GPU — consumer-level hardware.

First, he passed an image of a girl on a beach with a dog and asked what was happening. The model described the scene accurately: "The image captures a warm lovely moment between a person and a dog." It also worked quickly.

Then he fed in an audio file with two different speakers: a mix of two voices singing. The model transcribed both voices correctly, picking up each one separately. Witteveen noted he would not necessarily replace a dedicated speech recognition model with this, but for a pipeline where you want transcription and reasoning in a single model, it works well.

The most striking demo was speech-to-translation. He gave the model an English audio clip, defined Japanese as the target language, and asked it to transcribe and translate in one step. The model transcribed the English, then produced a Japanese translation. A check against Google Translate confirmed it was roughly correct. This is the E2B, the smallest model.

"Don't forget this is just the E2B model. This is a very small model."

The vision encoder for the edge models also shrank dramatically: from 300-350 million parameters in earlier models down to 150 million, making it faster as well as lighter.

Where you can run it

The models are available on Hugging Face and Google Cloud. Small edge models run on a T4 GPU. Larger workstation models without quantization need something in the H100 or RTX Pro 6000 range.

For serverless deployment, Google Cloud Run now supports a G4 GPU — an Nvidia RTX Pro 6000 with 96 GB of VRAM. That means you can serve the full-size workstation models in a serverless configuration that scales to zero when idle. No server running up a bill when nobody is using it.

Witteveen expects Ollama and LM Studio support to follow quickly. Base models are available alongside the instruction-tuned versions, which will interest developers who want to fine-tune for specific use cases.

Glossary

TermDefinition
Mixture of Experts (MoE)An architecture with many internal specialist networks. Only a small subset activates for each piece of text processed. Delivers large-model quality at smaller-model compute cost.
Dense modelA traditional model where all parameters work simultaneously for every response. More thorough, but uses more compute than MoE.
Apache 2.0 licenseAn open license that permits any use, including commercial deployment and modification, with no restrictions.
Function callingA feature where the model can invoke external tools or APIs mid-conversation. Built in from the architecture, not added through prompt tricks.
Context windowHow much text the model can hold in one conversation. 128K tokens is roughly one full novel. 256K is roughly two.
Edge modelA small AI model designed to run locally on a device like a phone or Raspberry Pi, without sending data to the cloud.
QAT (Quantization-Aware Training)A technique that compresses a model to use less memory while preserving output quality. Think of it as packing a suitcase more efficiently without leaving anything behind.
ASR (Automatic Speech Recognition)Technology that converts spoken words into text. What runs when you dictate a message on your phone.
MultimodalA model that processes more than just text. Gemma 4 handles text, images, and audio in a single model family.
ParametersThe numerical values that encode an AI model's knowledge. More parameters generally means more capability, but also more compute required.

Sources and resources

Share this article