Gemma 3N
Google's Next-Generation Open Source AI

The most advanced lightweight multimodal AI model designed for mobile and edge devices

What is Gemma 3N?

Gemma 3N is Google's latest open-source multimodal AI model, specifically optimized for on-device applications. It supports image, audio, video, and text input/output, designed for real-time processing on smartphones, tablets, and edge devices.

E2B: 5B parameters → 2GB memory
E4B: 8B parameters → 3GB memory
140+ languages support
LMArena Score: 1300+ (E4B)

🚀 Latest Release: Part of Gemma family with 1.6+ billion downloads worldwide

Why Choose Gemma 3N?

Revolutionary technical advantages that make it the most advanced mobile AI model

🎯 Core Advantages

Native Multimodal

Supports image, audio, video, text input/output natively

Two Model Sizes

E2B (5B→2GB) and E4B (8B→3GB) memory efficiency

Mobile-First

Supports Android, TPU, edge devices (Jetson etc.)

Mix-n-Match

Custom model slicing for specific hardware

Multimodal Support

Native support for image, audio, video, and text input/output. Process diverse content types seamlessly.

Efficient & Lightweight

Run 5B/8B parameter models with just 2GB/3GB memory. Optimized for mobile and edge devices.

Mobile-First Architecture

Supports Android, TPU, and edge devices (Jetson etc.) with optimized performance for mobile deployment.

Custom Model Slicing

Mix-n-Match architecture allows for flexible model configurations tailored to your specific needs.

High Accuracy

E4B model scores >1300 on LMArena (first in 10B+ category). Superior performance in real-world tasks.

Fast Response

KV Cache supports streaming input with 2x faster response times. Optimized for real-time applications.

Performance Benchmarks

Industry-leading performance metrics

1303
LMArena Elo Score
E4B Model
50.1%
WMT24++ Score
Multilingual ChrF
1.5x
Faster Response
On Mobile Devices
60fps
Video Processing
On Google Pixel

Core Technical Architecture

Revolutionary innovations that make Gemma 3N the most advanced mobile AI model

🧬 MatFormer Architecture

Matryoshka Transformer - like "Russian nesting dolls", one model contains multiple sub-models with nested inference capabilities.

Pre-extracted Models: Use separated E2B and E4B models
Mix-n-Match: Customize model size by adjusting hidden dimensions and layer skipping
Elastic Inference: Dynamic sub-model path switching during deployment

📦 PLE Technology

Per-Layer Embeddings - independent loading for each layer embedding, dramatically reducing memory pressure.

Core Transformer: Only core weights loaded into VRAM
CPU Computing: Other computations performed on CPU
Memory Efficiency: Significant VRAM usage reduction

🧠 KV Cache Sharing

Context cache sharing designed for audio/video long sequence input scenarios, significantly accelerating first response time.

Prefill Stage: 2x performance improvement
Long Sequences: Optimized for audio/video processing
Streaming: Faster initial response times

🎵 Audio Intelligence

Built-in speech understanding capabilities based on USM (Universal Speech Model) with fine-grained context awareness.

ASR: Automatic Speech Recognition
AST: Automatic Speech Translation
Languages: 140+ languages with enhanced Japanese, German, Korean, Spanish, French
Token Rate: One token per 160ms for fine-grained processing

👁️ Vision Processing: MobileNet-V5

Revolutionary MobileNet-V5-300M encoder designed for mobile-first vision understanding.

Native Resolutions

Low Resolution 256×256
Medium Resolution 512×512
High Resolution 768×768

Performance Improvements

Parameters: 46% reduction vs. SoViT
Memory: 4x less VRAM usage
Speed: 13x faster (with quantization)
Real-time: Up to 60fps on Google Pixel

Download & Deploy

Choose the Gemma 3N model version that fits your project needs

Google AI Studio

Try Online

No setup required
Instant access
Web interface
Try Now

HuggingFace

Model Hub

Complete model files
Multiple formats
Pre-trained weights
Download

Ollama

Local Deploy

One-click install
Local execution
CLI interface
ollama pull gemma:3n
Install Guide

Kaggle

Data Science

Research platform
Notebook ready
Community driven
Explore

Model Specifications

Gemma 3N E2B

5B
Parameters
2GB
Memory Usage
30
Layers (6 blocks)
4x
FFN Multiplier

Gemma 3N E4B

8B
Parameters
3GB
Memory Usage
35
Layers (7 blocks)
8x
FFN Multiplier

Platform Support

Hugging Face llama.cpp Ollama MLX Unsloth Android TPU Jetson

🧬 Core Technology Architecture

Revolutionary innovations that make Gemma 3N the most advanced mobile AI model

🏆 Performance Leadership

LMArena Elo Score Comparison - Gemma 3N E4B achieves 1303 score

Gemma 3N E4B scores 1303 on LMArena, ranking first among 10B+ parameter models and competing with much larger proprietary models like Gemini 1.5 Pro and GPT-4.1-nano.

🔧 MatFormer: Matryoshka Transformer

MatFormer Architecture Diagram - Showing flexible depth and FFN structure

🪆 Russian Doll Architecture

  • Nested Model Structure: One model contains multiple sub-models with different complexities
  • Flexible Depth: E2B (6 blocks) and E4B (7 blocks) with adaptive FFN multipliers
  • Dynamic Switching: Future support for elastic inference with runtime model selection

🎛️ Mix-n-Match Performance Optimization

📊 Scalable Performance

  • Custom Configuration: Adjust layers and parameters for specific hardware
  • Performance Range: MMLU scores from 50% to 62% across different configurations
  • Efficiency Sweet Spot: Optimal performance/parameter ratio at 2-4B effective parameters
MMLU Performance vs Model Sizes showing Mix-n-Match configurations

💾 PLE: Per-Layer Embeddings

PLE Loading Requirements comparison showing memory optimization

🧠 Memory Revolution

  • Dramatic Memory Reduction: From 5.44B to 1.91B parameters loaded (65% reduction)
  • Smart Caching: Only core Transformer weights in GPU memory, embeddings cached to fast storage
  • Optional Loading: Audio and vision parameters loaded only when needed

🚀 Innovation Summary

1303
LMArena Score
Best in class 10B+
65%
Memory Reduction
With PLE caching
7+6
Flexible Blocks
MatFormer architecture

Partner Ecosystem

Built in collaboration with industry leaders for comprehensive platform support

🏭 Hardware Partners

Qualcomm Technologies
Mobile chipset optimization
MediaTek
System-on-chip integration
Samsung System LSI
Mobile processor optimization

⚙️ Platform Partners

AMD
Axolotl
Docker
Hugging Face
llama.cpp
LMStudio
MLX
NVIDIA
Ollama
RedHat
SGLang
Unsloth
vLLM
Day One Launch

Gemma 3N launched with support across 13+ platforms, making it the most comprehensive model release to date.

Community & Resources

Join the growing community and explore additional resources

HuggingFace Hub

Access pre-trained models, code examples, and community contributions.

Explore Models →

Reddit Community

Join discussions, share experiences, and get help from the LocalLLaMA community.

Join Discussion →

OpenRouter API

Free API access to Gemma 3N E4B model for developers and researchers.

Get Free API →

Tech Analysis

In-depth technical analysis and insights from Simon Willison.

Read Analysis →

Getting Started

New to Gemma 3N? Start with our comprehensive getting started guide.

Get Started →

Frequently Asked Questions

Everything you need to know about Gemma 3N

Gemma 3N is Google's latest lightweight multimodal AI model designed for mobile and edge devices. It supports text, image, audio, and video input/output with exceptional efficiency and performance.

You can use Gemma 3N through multiple platforms including Ollama, HuggingFace, or Google AI Studio. The easiest method is using Ollama: install Ollama, then run ollama pull gemma:3n to download the model.

Yes, Gemma 3N is completely free to use. You can download the model for free, run it locally, or use Google AI Studio's free tier for online access.

Gemma 3N supports Windows, macOS, and Linux. The E2B model requires only 2GB RAM, while E4B needs 3GB RAM. It supports both CPU and GPU inference, with mobile-first optimization for Android and edge devices.