Four Paths to On-Device AI: An Update

Posted Apr 8, 2026

By Fuzzy Tiger

11 min read

TL;DR: Last month I wrote about three paths to local AI (BitNet, quantization, engine optimization). Gemma 4 just added a fourth: architecture efficiency. Here’s how all four fit together — and what the optimal combination looks like.

A month ago, I wrote about three paths to local AI: BitNet’s 1-bit training, 4-bit post-training quantization, and hardware-specific engine optimization. The thesis was that these three approaches weren’t competing — they were complementary layers.

Then Google shipped Gemma 4, and I realized I was missing a path.

The Missing Path: Architecture Efficiency

Gemma 4 doesn’t shrink weights (that’s quantization). It doesn’t train with 1-bit weights (that’s BitNet). It doesn’t optimize the inference engine (that’s MetalRT). Instead, it redesigns the transformer itself to do more with less.

Three techniques, stacked:

Per-Layer Embeddings (PLE). Traditional transformers share one embedding table across all layers. Gemma 4 gives each decoder layer its own. The embeddings are static lookup tables — read once, inject into the decoder, done. Because they’re read-only, they can live on flash storage instead of eating VRAM. Result: the E2B variant has 5.1B total parameters but only 2.3B active in memory (~2.54 GB).

Mixture of Experts (MoE). 26B total parameters, but only 4B activate per token. A router picks the right experts for each input. Knowledge capacity of a 26B model, compute cost of a 4B model. The catch: all 26B parameters still need to be loaded into memory.

Hybrid Attention. Most layers use sliding window attention (±512 tokens). A few layers get global attention (full 128K). Plus a trick where Key = Value in global layers, cutting KV cache in half.

The Updated Map

Here’s how all four paths compare:

Path	Representative	Solves	Doesn’t Solve
1. Train-time compression	BitNet 100B	Reasoning depth — fit more decoder layers per GB	Speed (5-7 tok/s)
2. Post-training compression	4-bit GGUF/MLX	Run existing models on small devices	Quality loss in edge cases
3. Architecture efficiency	Gemma 4 (PLE + MoE + Hybrid Attention)	Do more per parameter, per byte, per FLOP	Reasoning depth (still 4B-class at its core)
4. Engine optimization	MetalRT (550 tok/s)	Maximum speed on specific hardware	Intelligence ceiling (0.6-4B models)

What Each Path Actually Buys You

The key insight is that each path attacks a different bottleneck:

BitNet pushes on the hardest problem: reasoning depth. A 100B-parameter model thinks deeper than a 4B model, period. BitNet lets you fit 100B parameters in the memory footprint of a 40B 4-bit model. The cost is speed — 5-7 tok/s isn’t great for real-time conversation.

Quantization is the pragmatist’s choice. Take any model, compress it, ship it. Quality degrades at the margins but works for 80% of use cases. It’s the reason most people run local models today.

Architecture efficiency is what Gemma 4 brought to the table. PLE saves memory (embeddings on flash). MoE saves compute (4B active out of 26B). They’re complementary: PLE saves memory but not compute, MoE saves compute but not memory. Together, both improve. But neither adds reasoning depth. A Gemma 4B model with MoE routing is still thinking 4B-deep.

Engine optimization is pure speed. MetalRT hits 550 tok/s on M3 Max with small models. The models aren’t smarter — they’re just faster. Great for voice UIs where latency matters more than intelligence.

The Ranking

If you sort these by which problem is hardest to solve:

BitNet → Reasoning depth (hardest)
MoE → Knowledge breadth
PLE → Token representation richness
Engine optimization → Speed (easiest, relatively)

The Theoretical Optimum

Stack them all: BitNet’s massive decoder (for depth) + PLE embeddings stored on flash (for memory efficiency) + an engine like MetalRT for hardware-level speed.

Nobody’s built this yet. But the pieces exist independently. The real question is whether the combination compounds or conflicts.

What Changed From March to April

In March, I was asking: can models run locally at all?

In April, Gemma 4 shifted the question to: how do models get to users?

Google’s move with Gemma 4 isn’t just architecture innovation. It’s a distribution play. A small, efficient model with official cross-platform support — iOS, Android, web — shipped through app stores. One-click install. No pip install, no GGUF hunting, no llama.cpp config.

That’s a fundamentally different approach from the BitNet/quantization community, which optimizes for tinkerers running models in terminals.

Distribution might matter more than architecture. The best model that nobody can install loses to a decent model pre-loaded on every phone.

Where This Leaves Us

The four paths aren’t a menu where you pick one. They’re layers that stack. The question for 2026 isn’t “which path wins” — it’s “who combines them first.”

My bet: the first practical combination will be architecture efficiency (Gemma 4-style) + engine optimization (MetalRT-style), because both are ready now. BitNet at scale is still a research project. But when it graduates, the combination of all four will be genuinely transformative.

Until then, most of us will run 4-bit quantized models on whatever hardware we have, and that’s fine. The 80% solution is already here. We’re arguing about the last 20%.

一句话总结： 上个月写了三条路线（BitNet、量化、引擎优化），Gemma 4 带来了第四条：架构效率。四条路线各解决不同瓶颈，理论最优是全部叠加。

一个月前写了三条路线，说 BitNet、4-bit 量化和引擎优化是互补的三层。当时觉得挺完整。

然后 Google 发了 Gemma 4，才发现漏了一条。

新路线：架构效率

Gemma 4 不压缩权重（那是量化），不用 1-bit 训练（那是 BitNet），也不优化引擎（那是 MetalRT）。它的做法是重新设计 transformer 本身，让每一分资源都用得更高效。

三招叠在一起：

PLE（逐层嵌入）。 传统 transformer 所有层共享一张 embedding 表。Gemma 4 给每个 decoder 层配了独立的。这些 embedding 是静态查表——查一次、注入 decoder、完事。因为只读不写，可以放在闪存上，不占显存。结果：E2B 版本总参数 5.1B，实际驻留内存只有 2.3B（约 2.54 GB）。

MoE（混合专家）。 总参数 26B，每个 token 只激活 4B。路由器按输入分配专家。知识容量是 26B 级的，计算开销是 4B 级的。但有个坑：26B 参数全得加载进内存。

混合注意力。 大部分层用滑动窗口（只看前后 512 token），少数层用全局注意力（看全部 128K）。全局层还有个巧思——Key = Value，KV cache 直接砍半。

四条路线全景

路线	代表	解决什么	不解决什么
① 训练时压缩	BitNet 100B	推理深度——同样内存塞更多 decoder 层	速度（5-7 tok/s）
② 训练后压缩	4-bit GGUF/MLX	让现有模型跑在小设备上	边缘场景质量打折
③ 架构效率	Gemma 4 (PLE+MoE+混合注意力)	每个参数、每字节、每 FLOP 做更多事	推理深度（本质还是 4B 级）
④ 引擎优化	MetalRT (550 tok/s)	特定硬件上的极致速度	智力天花板（0.6-4B 模型）

每条路线到底买了什么

关键是四条路线打的不是同一个瓶颈：

BitNet 啃的是最硬的骨头：推理深度。1000 亿参数的模型就是比 40 亿的想得深，没有捷径。BitNet 让你用 40B 4-bit 模型的内存装下 100B 参数。代价是慢——5-7 tok/s，实时对话够呛。

量化是实用主义者的选择。拿过来、压一下、跑起来。边缘场景打折，但 80% 的场景够用。今天大多数人本地跑模型靠的就是这个。

架构效率是 Gemma 4 带来的新维度。PLE 省内存（embedding 放闪存），MoE 省算力（26B 里只激活 4B）。两者互补：PLE 省内存不省算力，MoE 省算力不省内存，叠起来两个都省。但推理深度？哪个都没碰。4B 级的模型加了 MoE 路由，想问题还是 4B 的深度。

引擎优化纯粹拼速度。MetalRT 在 M3 Max 上跑出 550 tok/s，模型没变聪明，只是变快了。做语音交互很香——延迟比智力重要。

排个序

按解决的问题难度排：

BitNet → 推理深度（最难）
MoE → 知识广度
PLE → token 表示丰富度
引擎优化 → 速度（相对最容易）

理论最优组合

全叠上：BitNet 的巨型 decoder（要深度）+ PLE embedding 放闪存（要省内存）+ MetalRT 级的引擎（要速度）。

现在还没人做出来。但零件都有了。真正的问题是组合起来是相乘还是相冲。

从三月到四月，变了什么

三月的问题是：模型能不能在本地跑？

四月，Gemma 4 把问题变成了：模型怎么到用户手里？

Google 的棋不只是架构创新，更是分发策略。小巧高效的模型，官方跨平台支持——iOS、Android、Web——通过应用商店分发。一键安装。不用 pip install，不用找 GGUF 文件，不用配 llama.cpp。

这跟 BitNet/量化社区的思路完全不同。那边是给在终端里折腾模型的极客优化的。

分发可能比架构更重要。 再好的模型，装不上就白搭。一个还行的模型预装在每台手机上，赢面更大。

现在在哪

四条路线不是选一条。是四层叠在一起。2026 年的问题不是”哪条路线赢”，而是”谁先把它们组合起来”。

我的判断：最先落地的组合大概是架构效率（Gemma 4 那套）+ 引擎优化（MetalRT 那套），因为两个现在都能用。BitNet 规模化还在研究阶段。但等它毕业那天，四条路线叠满的效果会真正改变局面。

在那之前，大多数人还是拿着 4-bit 量化的模型在手边的硬件上跑，这就够了。80% 的问题已经解决了。我们争的是最后 20%。

Technology

This post is licensed under CC BY 4.0 by the author.