ANEMLL Project Reveals Apple Neural Engine's Mixed Performance for LLM Inference

BigGo Editorial Team

ANEMLL Project Reveals Apple Neural Engine's Mixed Performance for LLM Inference

Apple's Neural Engine (ANE) has long been a mysterious component in Apple Silicon chips, with limited documentation and accessibility for developers. The new open-source project ANEMLL (pronounced animal) aims to change that by providing tools to port Large Language Models to the ANE, but community testing reveals both advantages and significant limitations.

Performance Tradeoffs: Speed vs. Power Efficiency

Testing by community members shows that while ANE-optimized models run slower than GPU implementations, they offer remarkable power efficiency. One user reported that on an M4 Pro, a Llama 3.2 1B model achieved about 62 tokens per second while drawing only 2.8 watts of power. In comparison, GPU implementations can be twice as fast but consume approximately 20 watts—nearly 10 times the power. This efficiency makes ANE particularly valuable for mobile devices where battery life is critical.

However, direct comparisons between ANEMLL and other frameworks like MLX show significant performance gaps. A benchmark running DeepSeek R1-8B on an M4 Max showed ANEMLL achieving only 9.3 tokens per second compared to MLX's 31.33 tokens per second for an 8-bit quantized version. This performance difference raises questions about whether the power savings justify the speed reduction for most use cases.

Performance Comparison: ANEMLL vs MLX on M4 Max

Framework	Model	Performance	Memory Usage
ANEMLL	DeepSeek R1-8B	9.3 tok/sec	~500MB
MLX (8-bit)	DeepSeek R1-8B	31.33 tok/sec	~8.5GB
MLX (bf16)	DeepSeek R1-8B	27.17 tok/sec	~15.7GB

Power Efficiency Comparison

Hardware	Model	Performance	Power Usage
M1 Max (ANE)	Llama 3.2-1B	47 tok/sec	~1.8 watts
M4 Pro (ANE)	Llama 3.2-1B	62 tok/sec	~2.8 watts
GPU implementation	Similar models	~2x faster	~20 watts

Memory Efficiency and Technical Limitations

One surprising advantage of ANEMLL appears to be memory efficiency. The same benchmark that showed slower performance also revealed dramatically lower memory usage—approximately 500MB for ANEMLL compared to 8.5GB for MLX's 8-bit model. This efficiency could make ANE implementations particularly valuable for running models on devices with limited memory, such as iPhones and iPads.

The technical challenges of working with ANE stem from its hardware constraints. Unlike GPUs, the ANE requires fixed input/output shapes, making dynamic operations like growing attention caches difficult. It also only supports FP16 precision (not BF16), which can lead to activation overflow issues. Developers have had to implement creative workarounds, such as using conv2d operations instead of linear layers and developing sliding window approaches for key-value caches.

Apple's Closed Ecosystem Approach

The community discussion reveals frustration with Apple's approach to AI acceleration. Despite Apple's own research papers claiming significant performance improvements for ANE-optimized models, the company has provided limited documentation and tools for developers. Even Apple's own MLX framework doesn't support the ANE, raising questions about the company's strategy.

Some commenters have drawn parallels to Qualcomm's NPU in Snapdragon X laptops, suggesting that hardware manufacturers often oversell the capabilities of their neural processing units for AI workloads. The reality is that these specialized chips excel at specific, limited tasks but may not deliver the promised performance for the large models users actually want to run.

As one community member noted:

The key benefit is significant lower power usage. Benchmarked llama3.2-1B on my machines; M1 Max (47t/s, ~1.8 watts), M4 Pro (62t/s, ~2.8 watts). The GPU is twice as fast (even faster on the Max), but draws much more power (~20 watts) vs the ANE.

The ANEMLL project represents an important step toward making Apple's Neural Engine more accessible to developers, but the current performance characteristics suggest it may be most valuable for specific use cases prioritizing power efficiency over raw speed. As Apple continues to evolve its hardware with newer M-series chips, the balance between the ANE, CPU, and GPU capabilities may shift, potentially making the Neural Engine more competitive for general AI workloads.

Reference: ANEMLL