Post

Four Paths to On-Device AI: An Update

Four Paths to On-Device AI: An Update

TL;DR: Last month I wrote about three paths to local AI (BitNet, quantization, engine optimization). Gemma 4 just added a fourth: architecture efficiency. Here’s how all four fit together — and what the optimal combination looks like.


A month ago, I wrote about three paths to local AI: BitNet’s 1-bit training, 4-bit post-training quantization, and hardware-specific engine optimization. The thesis was that these three approaches weren’t competing — they were complementary layers.

Then Google shipped Gemma 4, and I realized I was missing a path.

The Missing Path: Architecture Efficiency

Gemma 4 doesn’t shrink weights (that’s quantization). It doesn’t train with 1-bit weights (that’s BitNet). It doesn’t optimize the inference engine (that’s MetalRT). Instead, it redesigns the transformer itself to do more with less.

Three techniques, stacked:

Per-Layer Embeddings (PLE). Traditional transformers share one embedding table across all layers. Gemma 4 gives each decoder layer its own. The embeddings are static lookup tables — read once, inject into the decoder, done. Because they’re read-only, they can live on flash storage instead of eating VRAM. Result: the E2B variant has 5.1B total parameters but only 2.3B active in memory (~2.54 GB).

Mixture of Experts (MoE). 26B total parameters, but only 4B activate per token. A router picks the right experts for each input. Knowledge capacity of a 26B model, compute cost of a 4B model. The catch: all 26B parameters still need to be loaded into memory.

Hybrid Attention. Most layers use sliding window attention (±512 tokens). A few layers get global attention (full 128K). Plus a trick where Key = Value in global layers, cutting KV cache in half.

The Updated Map

Here’s how all four paths compare:

Path Representative Solves Doesn’t Solve
1. Train-time compression BitNet 100B Reasoning depth — fit more decoder layers per GB Speed (5-7 tok/s)
2. Post-training compression 4-bit GGUF/MLX Run existing models on small devices Quality loss in edge cases
3. Architecture efficiency Gemma 4 (PLE + MoE + Hybrid Attention) Do more per parameter, per byte, per FLOP Reasoning depth (still 4B-class at its core)
4. Engine optimization MetalRT (550 tok/s) Maximum speed on specific hardware Intelligence ceiling (0.6-4B models)

What Each Path Actually Buys You

The key insight is that each path attacks a different bottleneck:

BitNet pushes on the hardest problem: reasoning depth. A 100B-parameter model thinks deeper than a 4B model, period. BitNet lets you fit 100B parameters in the memory footprint of a 40B 4-bit model. The cost is speed — 5-7 tok/s isn’t great for real-time conversation.

Quantization is the pragmatist’s choice. Take any model, compress it, ship it. Quality degrades at the margins but works for 80% of use cases. It’s the reason most people run local models today.

Architecture efficiency is what Gemma 4 brought to the table. PLE saves memory (embeddings on flash). MoE saves compute (4B active out of 26B). They’re complementary: PLE saves memory but not compute, MoE saves compute but not memory. Together, both improve. But neither adds reasoning depth. A Gemma 4B model with MoE routing is still thinking 4B-deep.

Engine optimization is pure speed. MetalRT hits 550 tok/s on M3 Max with small models. The models aren’t smarter — they’re just faster. Great for voice UIs where latency matters more than intelligence.

The Ranking

If you sort these by which problem is hardest to solve:

  1. BitNet → Reasoning depth (hardest)
  2. MoE → Knowledge breadth
  3. PLE → Token representation richness
  4. Engine optimization → Speed (easiest, relatively)

The Theoretical Optimum

Stack them all: BitNet’s massive decoder (for depth) + PLE embeddings stored on flash (for memory efficiency) + an engine like MetalRT for hardware-level speed.

Nobody’s built this yet. But the pieces exist independently. The real question is whether the combination compounds or conflicts.

What Changed From March to April

In March, I was asking: can models run locally at all?

In April, Gemma 4 shifted the question to: how do models get to users?

Google’s move with Gemma 4 isn’t just architecture innovation. It’s a distribution play. A small, efficient model with official cross-platform support — iOS, Android, web — shipped through app stores. One-click install. No pip install, no GGUF hunting, no llama.cpp config.

That’s a fundamentally different approach from the BitNet/quantization community, which optimizes for tinkerers running models in terminals.

Distribution might matter more than architecture. The best model that nobody can install loses to a decent model pre-loaded on every phone.

Where This Leaves Us

The four paths aren’t a menu where you pick one. They’re layers that stack. The question for 2026 isn’t “which path wins” — it’s “who combines them first.”

My bet: the first practical combination will be architecture efficiency (Gemma 4-style) + engine optimization (MetalRT-style), because both are ready now. BitNet at scale is still a research project. But when it graduates, the combination of all four will be genuinely transformative.

Until then, most of us will run 4-bit quantized models on whatever hardware we have, and that’s fine. The 80% solution is already here. We’re arguing about the last 20%.



一句话总结: 上个月写了三条路线(BitNet、量化、引擎优化),Gemma 4 带来了第四条:架构效率。四条路线各解决不同瓶颈,理论最优是全部叠加。


一个月前写了三条路线,说 BitNet、4-bit 量化和引擎优化是互补的三层。当时觉得挺完整。

然后 Google 发了 Gemma 4,才发现漏了一条。

新路线:架构效率

Gemma 4 不压缩权重(那是量化),不用 1-bit 训练(那是 BitNet),也不优化引擎(那是 MetalRT)。它的做法是重新设计 transformer 本身,让每一分资源都用得更高效。

三招叠在一起:

PLE(逐层嵌入)。 传统 transformer 所有层共享一张 embedding 表。Gemma 4 给每个 decoder 层配了独立的。这些 embedding 是静态查表——查一次、注入 decoder、完事。因为只读不写,可以放在闪存上,不占显存。结果:E2B 版本总参数 5.1B,实际驻留内存只有 2.3B(约 2.54 GB)。

MoE(混合专家)。 总参数 26B,每个 token 只激活 4B。路由器按输入分配专家。知识容量是 26B 级的,计算开销是 4B 级的。但有个坑:26B 参数全得加载进内存。

混合注意力。 大部分层用滑动窗口(只看前后 512 token),少数层用全局注意力(看全部 128K)。全局层还有个巧思——Key = Value,KV cache 直接砍半。

四条路线全景

路线 代表 解决什么 不解决什么
① 训练时压缩 BitNet 100B 推理深度——同样内存塞更多 decoder 层 速度(5-7 tok/s)
② 训练后压缩 4-bit GGUF/MLX 让现有模型跑在小设备上 边缘场景质量打折
③ 架构效率 Gemma 4 (PLE+MoE+混合注意力) 每个参数、每字节、每 FLOP 做更多事 推理深度(本质还是 4B 级)
④ 引擎优化 MetalRT (550 tok/s) 特定硬件上的极致速度 智力天花板(0.6-4B 模型)

每条路线到底买了什么

关键是四条路线打的不是同一个瓶颈:

BitNet 啃的是最硬的骨头:推理深度。1000 亿参数的模型就是比 40 亿的想得深,没有捷径。BitNet 让你用 40B 4-bit 模型的内存装下 100B 参数。代价是慢——5-7 tok/s,实时对话够呛。

量化是实用主义者的选择。拿过来、压一下、跑起来。边缘场景打折,但 80% 的场景够用。今天大多数人本地跑模型靠的就是这个。

架构效率是 Gemma 4 带来的新维度。PLE 省内存(embedding 放闪存),MoE 省算力(26B 里只激活 4B)。两者互补:PLE 省内存不省算力,MoE 省算力不省内存,叠起来两个都省。但推理深度?哪个都没碰。4B 级的模型加了 MoE 路由,想问题还是 4B 的深度。

引擎优化纯粹拼速度。MetalRT 在 M3 Max 上跑出 550 tok/s,模型没变聪明,只是变快了。做语音交互很香——延迟比智力重要。

排个序

按解决的问题难度排:

  1. BitNet → 推理深度(最难)
  2. MoE → 知识广度
  3. PLE → token 表示丰富度
  4. 引擎优化 → 速度(相对最容易)

理论最优组合

全叠上:BitNet 的巨型 decoder(要深度)+ PLE embedding 放闪存(要省内存)+ MetalRT 级的引擎(要速度)。

现在还没人做出来。但零件都有了。真正的问题是组合起来是相乘还是相冲。

从三月到四月,变了什么

三月的问题是:模型能不能在本地跑?

四月,Gemma 4 把问题变成了:模型怎么到用户手里?

Google 的棋不只是架构创新,更是分发策略。小巧高效的模型,官方跨平台支持——iOS、Android、Web——通过应用商店分发。一键安装。不用 pip install,不用找 GGUF 文件,不用配 llama.cpp。

这跟 BitNet/量化社区的思路完全不同。那边是给在终端里折腾模型的极客优化的。

分发可能比架构更重要。 再好的模型,装不上就白搭。一个还行的模型预装在每台手机上,赢面更大。

现在在哪

四条路线不是选一条。是四层叠在一起。2026 年的问题不是”哪条路线赢”,而是”谁先把它们组合起来”。

我的判断:最先落地的组合大概是架构效率(Gemma 4 那套)+ 引擎优化(MetalRT 那套),因为两个现在都能用。BitNet 规模化还在研究阶段。但等它毕业那天,四条路线叠满的效果会真正改变局面。

在那之前,大多数人还是拿着 4-bit 量化的模型在手边的硬件上跑,这就够了。80% 的问题已经解决了。我们争的是最后 20%。

This post is licensed under CC BY 4.0 by the author.