A software engineer's tale of optimization gone wrong has sparked widespread discussion about the critical importance of realistic benchmark data in performance testing. The developer spent two weeks crafting a hand-optimized assembly implementation that delivered impressive 4x speedups in testing, only to discover it provided zero benefit in production due to fundamentally flawed benchmark assumptions.
The Pitfall of Random Data in Performance Testing
The core issue emerged from using random numbers to test a Varint encoding optimization. While this approach seemed logical for comprehensive testing, it created a completely unrealistic scenario. Random 64-bit numbers are mathematically biased toward extremely large values - with 50% requiring the maximum 10 bytes to encode and nearly all others needing 8-9 bytes. This meant the benchmark was essentially testing the algorithm's worst-case performance scenario rather than typical usage patterns.
The community has highlighted how this represents a broader challenge in performance optimization work. Creating representative test scenarios and implementing them correctly in microbenchmarks are both extremely difficult tasks, and failures in this area can be hard to detect until significant time has been invested.
Varint (Variable-length integer): A compact encoding method that uses 1-10 bytes to represent integers, with smaller numbers requiring fewer bytes
Varint Encoding Size Distribution for Random 64-bit Numbers:
- 50% require 10 bytes (maximum)
- ~49.6% require 9 bytes
- ~0.38% require 8 bytes
- ~0.003% require 7 bytes
- ~0.00000000000009% require 2 bytes
- ~0.0000000000000006% require 1 byte
Real-World Data Distributions Tell a Different Story
The revelation came when examining actual production data, which revealed that real-world numbers tend to be dramatically smaller than random test data would suggest. Looking around at everyday numbers - page counts, reference numbers, product codes - most fit comfortably within just two bytes. This fundamental difference explains why Varint encoding exists in the first place: to efficiently handle the small numbers that dominate real applications.
This phenomenon connects to broader mathematical principles like Benford's Law, which observes that in many real-world datasets, smaller leading digits appear more frequently than larger ones. The mismatch between mathematical randomness and practical data distributions represents a common trap for developers optimizing algorithms.
Varint Encoding Format:
- Numbers < 128:
1nnnnnnn(1 byte) - Numbers < 16,384:
1nnnnnnn 0nnnnnnn(2 bytes) - Numbers < 2,097,152:
1nnnnnnn 1nnnnnnn 0nnnnnnn(3 bytes) - 32-bit integers: 1-5 bytes required
- 64-bit integers: 1-10 bytes required
- Used in: Google Protobuf, Apache Thrift, WASM, DWARF
The Challenge of Representative Benchmarking
The discussion has revealed that even experienced developers struggle with creating meaningful performance tests. Some organizations have explored using production data profiles or applying filters to real user data for more accurate benchmarking, though these approaches face their own practical challenges.
Identifying a representative usage scenario to optimize towards and then implementing that scenario in a microbenchmark test driver are both massively difficult to get right
The story serves as a reminder that impressive benchmark numbers mean nothing without realistic test conditions. While the developer's technical achievement in creating a 4x faster algorithm was genuine, the optimization solved a problem that didn't exist in practice. The experience ultimately became valuable as a proof-of-concept for custom JIT optimizations, even though the specific implementation was rolled back.
This case highlights why successful performance optimization requires as much attention to measurement methodology as to the actual code improvements. Without representative data, even the most sophisticated optimizations can become elaborate solutions to non-existent problems.
Reference: That time I wasted weeks hand optimizing assembly because I benchmarked on random data
