GotGemini

Google ships Gemma 4 12B for local multimodal agents

Gemma 4 12B is a dense encoder-free multimodal open model for laptops. It supports text, image, audio and video inputs with a 256K context window.

v1· gemma-4-12b / ai-studio v1· June 4, 2026
Google Gemma 4 12B launch artwork

Google introduced Gemma 4 12B, a dense multimodal open model designed to run locally on laptops. It is the first mid-sized Gemma model with native audio input and uses a unified encoder-free architecture for image and audio inputs instead of separate multimodal encoders.

This model is great for building local agents, multimodal assistants, coding tools, transcription workflows, and edge applications that need open weights instead of a hosted Gemini API call.

Availability: Google says the pre-trained and instruction-tuned checkpoints are available from Hugging Face and Kaggle. The model is released under Apache 2.0, with ecosystem support listed for LM Studio, Ollama, Hugging Face Transformers, llama.cpp, MLX, vLLM, Unsloth, Gemini Enterprise Agent Platform Model Garden, Cloud Run and GKE.

python
from transformers import AutoProcessor, AutoModelForMultimodalLM

MODEL_ID = "google/gemma-4-12B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(MODEL_ID, device_map="auto")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe what is in this image and list any visible text."},
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=True))

Gemma 4 12B is an open-weight model released under Apache 2.0, so there is no per-token Gemini API price for the weights themselves. Actual cost depends on where it runs: local hardware, Hugging Face/Kaggle notebooks, Ollama/LM Studio, vLLM infrastructure, Cloud Run, GKE, or Gemini Enterprise Agent Platform Model Garden.

Limits Google disclosed: 256K context for medium Gemma 4 models including 12B; audio input up to 30 seconds; video support by processing frames, with the Hugging Face card describing up to 60 seconds assuming one frame per second. Google says the model is small enough for laptops with 16GB VRAM or unified memory.

What it doesn't do (yet)

  • The model card describes Gemma 4 models as generating text output; it is not an any-output generator for image, audio or video creation.
  • Google did not publish a hosted Gemini API model ID for Gemma 4 12B in the launch post. This is primarily an open-weight/local and platform-deployment release.
  • Pricing for managed deployments is not in the launch post; editors should verify Cloud Run, GKE, Hugging Face, Kaggle or Agent Platform costs for the chosen surface before publishing a cost claim.

Discussion

Questions and comments from readers.