Four Paths to On-Device AI: An Update
TL;DR: Last month I wrote about three paths to local AI (BitNet, quantization, engine optimization). Gemma 4 just added a fourth: architecture efficiency. Here’s how all four fit together — and what the optimal combination looks like.
A month ago, I wrote about three paths to local AI: BitNet’s 1-bit training, 4-bit post-training quantization, and hardware-specific engine optimization. The thesis was that these three approaches weren’t competing — they were complementary layers.
Then Google shipped Gemma 4, and I realized I was missing a path.
The Missing Path: Architecture Efficiency
Gemma 4 doesn’t shrink weights (that’s quantization). It doesn’t train with 1-bit weights (that’s BitNet). It doesn’t optimize the inference engine (that’s MetalRT). Instead, it redesigns the transformer itself to do more with less.
Three techniques, stacked:
Per-Layer Embeddings (PLE). Traditional transformers share one embedding table across all layers. Gemma 4 gives each decoder layer its own. The embeddings are static lookup tables — read once, inject into the decoder, done. Because they’re read-only, they can live on flash storage instead of eating VRAM. Result: the E2B variant has 5.1B total parameters but only 2.3B active in memory (~2.54 GB).
Mixture of Experts (MoE). 26B total parameters, but only 4B activate per token. A router picks the right experts for each input. Knowledge capacity of a 26B model, compute cost of a 4B model. The catch: all 26B parameters still need to be loaded into memory.
Hybrid Attention. Most layers use sliding window attention (±512 tokens). A few layers get global attention (full 128K). Plus a trick where Key = Value in global layers, cutting KV cache in half.
The Updated Map
Here’s how all four paths compare:
| Path | Representative | Solves | Doesn’t Solve |
|---|---|---|---|
| 1. Train-time compression | BitNet 100B | Reasoning depth — fit more decoder layers per GB | Speed (5-7 tok/s) |
| 2. Post-training compression | 4-bit GGUF/MLX | Run existing models on small devices | Quality loss in edge cases |
| 3. Architecture efficiency | Gemma 4 (PLE + MoE + Hybrid Attention) | Do more per parameter, per byte, per FLOP | Reasoning depth (still 4B-class at its core) |
| 4. Engine optimization | MetalRT (550 tok/s) | Maximum speed on specific hardware | Intelligence ceiling (0.6-4B models) |
What Each Path Actually Buys You
The key insight is that each path attacks a different bottleneck:
BitNet pushes on the hardest problem: reasoning depth. A 100B-parameter model thinks deeper than a 4B model, period. BitNet lets you fit 100B parameters in the memory footprint of a 40B 4-bit model. The cost is speed — 5-7 tok/s isn’t great for real-time conversation.
Quantization is the pragmatist’s choice. Take any model, compress it, ship it. Quality degrades at the margins but works for 80% of use cases. It’s the reason most people run local models today.
Architecture efficiency is what Gemma 4 brought to the table. PLE saves memory (embeddings on flash). MoE saves compute (4B active out of 26B). They’re complementary: PLE saves memory but not compute, MoE saves compute but not memory. Together, both improve. But neither adds reasoning depth. A Gemma 4B model with MoE routing is still thinking 4B-deep.
Engine optimization is pure speed. MetalRT hits 550 tok/s on M3 Max with small models. The models aren’t smarter — they’re just faster. Great for voice UIs where latency matters more than intelligence.
The Ranking
If you sort these by which problem is hardest to solve:
- BitNet → Reasoning depth (hardest)
- MoE → Knowledge breadth
- PLE → Token representation richness
- Engine optimization → Speed (easiest, relatively)
The Theoretical Optimum
Stack them all: BitNet’s massive decoder (for depth) + PLE embeddings stored on flash (for memory efficiency) + an engine like MetalRT for hardware-level speed.
Nobody’s built this yet. But the pieces exist independently. The real question is whether the combination compounds or conflicts.
What Changed From March to April
In March, I was asking: can models run locally at all?
In April, Gemma 4 shifted the question to: how do models get to users?
Google’s move with Gemma 4 isn’t just architecture innovation. It’s a distribution play. A small, efficient model with official cross-platform support — iOS, Android, web — shipped through app stores. One-click install. No pip install, no GGUF hunting, no llama.cpp config.
That’s a fundamentally different approach from the BitNet/quantization community, which optimizes for tinkerers running models in terminals.
Distribution might matter more than architecture. The best model that nobody can install loses to a decent model pre-loaded on every phone.
Where This Leaves Us
The four paths aren’t a menu where you pick one. They’re layers that stack. The question for 2026 isn’t “which path wins” — it’s “who combines them first.”
My bet: the first practical combination will be architecture efficiency (Gemma 4-style) + engine optimization (MetalRT-style), because both are ready now. BitNet at scale is still a research project. But when it graduates, the combination of all four will be genuinely transformative.
Until then, most of us will run 4-bit quantized models on whatever hardware we have, and that’s fine. The 80% solution is already here. We’re arguing about the last 20%.
一句话总结: 上个月写了三条路线(BitNet、量化、引擎优化),Gemma 4 带来了第四条:架构效率。四条路线各解决不同瓶颈,理论最优是全部叠加。
一个月前写了三条路线,说 BitNet、4-bit 量化和引擎优化是互补的三层。当时觉得挺完整。
然后 Google 发了 Gemma 4,才发现漏了一条。
新路线:架构效率
Gemma 4 不压缩权重(那是量化),不用 1-bit 训练(那是 BitNet),也不优化引擎(那是 MetalRT)。它的做法是重新设计 transformer 本身,让每一分资源都用得更高效。
三招叠在一起:
PLE(逐层嵌入)。 传统 transformer 所有层共享一张 embedding 表。Gemma 4 给每个 decoder 层配了独立的。这些 embedding 是静态查表——查一次、注入 decoder、完事。因为只读不写,可以放在闪存上,不占显存。结果:E2B 版本总参数 5.1B,实际驻留内存只有 2.3B(约 2.54 GB)。
MoE(混合专家)。 总参数 26B,每个 token 只激活 4B。路由器按输入分配专家。知识容量是 26B 级的,计算开销是 4B 级的。但有个坑:26B 参数全得加载进内存。
混合注意力。 大部分层用滑动窗口(只看前后 512 token),少数层用全局注意力(看全部 128K)。全局层还有个巧思——Key = Value,KV cache 直接砍半。
四条路线全景
| 路线 | 代表 | 解决什么 | 不解决什么 |
|---|---|---|---|
| ① 训练时压缩 | BitNet 100B | 推理深度——同样内存塞更多 decoder 层 | 速度(5-7 tok/s) |
| ② 训练后压缩 | 4-bit GGUF/MLX | 让现有模型跑在小设备上 | 边缘场景质量打折 |
| ③ 架构效率 | Gemma 4 (PLE+MoE+混合注意力) | 每个参数、每字节、每 FLOP 做更多事 | 推理深度(本质还是 4B 级) |
| ④ 引擎优化 | MetalRT (550 tok/s) | 特定硬件上的极致速度 | 智力天花板(0.6-4B 模型) |
每条路线到底买了什么
关键是四条路线打的不是同一个瓶颈:
BitNet 啃的是最硬的骨头:推理深度。1000 亿参数的模型就是比 40 亿的想得深,没有捷径。BitNet 让你用 40B 4-bit 模型的内存装下 100B 参数。代价是慢——5-7 tok/s,实时对话够呛。
量化是实用主义者的选择。拿过来、压一下、跑起来。边缘场景打折,但 80% 的场景够用。今天大多数人本地跑模型靠的就是这个。
架构效率是 Gemma 4 带来的新维度。PLE 省内存(embedding 放闪存),MoE 省算力(26B 里只激活 4B)。两者互补:PLE 省内存不省算力,MoE 省算力不省内存,叠起来两个都省。但推理深度?哪个都没碰。4B 级的模型加了 MoE 路由,想问题还是 4B 的深度。
引擎优化纯粹拼速度。MetalRT 在 M3 Max 上跑出 550 tok/s,模型没变聪明,只是变快了。做语音交互很香——延迟比智力重要。
排个序
按解决的问题难度排:
- BitNet → 推理深度(最难)
- MoE → 知识广度
- PLE → token 表示丰富度
- 引擎优化 → 速度(相对最容易)
理论最优组合
全叠上:BitNet 的巨型 decoder(要深度)+ PLE embedding 放闪存(要省内存)+ MetalRT 级的引擎(要速度)。
现在还没人做出来。但零件都有了。真正的问题是组合起来是相乘还是相冲。
从三月到四月,变了什么
三月的问题是:模型能不能在本地跑?
四月,Gemma 4 把问题变成了:模型怎么到用户手里?
Google 的棋不只是架构创新,更是分发策略。小巧高效的模型,官方跨平台支持——iOS、Android、Web——通过应用商店分发。一键安装。不用 pip install,不用找 GGUF 文件,不用配 llama.cpp。
这跟 BitNet/量化社区的思路完全不同。那边是给在终端里折腾模型的极客优化的。
分发可能比架构更重要。 再好的模型,装不上就白搭。一个还行的模型预装在每台手机上,赢面更大。
现在在哪
四条路线不是选一条。是四层叠在一起。2026 年的问题不是”哪条路线赢”,而是”谁先把它们组合起来”。
我的判断:最先落地的组合大概是架构效率(Gemma 4 那套)+ 引擎优化(MetalRT 那套),因为两个现在都能用。BitNet 规模化还在研究阶段。但等它毕业那天,四条路线叠满的效果会真正改变局面。
在那之前,大多数人还是拿着 4-bit 量化的模型在手边的硬件上跑,这就够了。80% 的问题已经解决了。我们争的是最后 20%。