BitNet on Apple Silicon: Metal works, CPU is broken

We tested Microsoft BitNet b1.58 on a base M2. Metal gives 12 t/s. CPU-only produces gibberish. The real value of 1.58-bit is RAM, not speed.

Microsoft promises 1.58-bit LLMs running on CPU, no GPU required. The README says "100B parameters on a single CPU at 5–7 tokens/sec." We tested it on real hardware: base M2 / 16 GB, BitNet-b1.58-2B-4T and Falcon3-7B-Instruct-1.58bit.

Result: via Metal — 12.13 t/s on the 2B model, coherent output. Add -ngl 0 (CPU-only) — same model, same prompt — and you get "no/var receivedSED l mode74ll encouraged bre speaking removed brown flight." Throughput looks plausible at 8.37 t/s. If you don't read the output, you'd think it works.

Falcon3-7B-1.58bit via Metal: coherent output, clean instruction-following. But 1.70 t/s — slower than humans read. Not viable for interactive chat on M2 today.

Separate issue: the converter in bitnet.cpp doesn't recognize BitNetForCausalLM architecture. setup_env.py throws NotImplementedError on both 2B-4T and Falcon3. The working path is to download pre-built GGUFs from companion repos on HuggingFace.

Key takeaway: the value of 1.58-bit on Apple Silicon is RAM, not speed. A 7B model in 4 GB instead of 14 GB. That means 7B fits in 16 GB unified memory with room for the OS and apps. When a native 13–30B BitNet checkpoint ships, that's the inflection point for on-device.

We're building Jippy — a personal AI that lives on your phone. The entire edge-inference story depends on quantization paths like BitNet reaching production-grade. BitNet isn't ready for ARM CPU yet. Via Metal — it already works.

Full writeup with numbers and reproduction steps on our Substack.