Performance Experiments are Hard
Running good experiments is obviously hard. This is exemplified by the numerous resources you can find online about the reproducibility crisis in various scientific fields. Naturally, running performance experiments in computer science is also hard, and it is too easy to misinterpret the results or to present data which hides the complete story.
I found it interesting how multiple fields of computer science have papers that talk about benchmarking pitfalls in their domains, and wanted to start compiling a list for myself. Maybe it's useful for you too.
Systems Programming
- Producing wrong data without doing anything wrong!
- Systems Benchmarking Crimes
- Can Hardware Performance Counters be Trusted?
- Nondeterminism in Hardware Performance Counters
A note on performance counters. Depending on how paranoid you are, you should double check that the performance counters your tool advertises are actually
listed in your processor manual. For example, when using perf list, the tool advertises that I can measure l2_latency.l2_cycles_waiting_on_fills,
even though I couldn't find that event anywhere in my Zen3 manual. It seems like the PR that added support for measuring that event
added that event because the measured count was non-zero, even though it wasn't listed in the manual. It was submitted and approved by AMD employees,
so I might just be overly cautious, but I personally wouldn't feel comfortable using that event for any rigorous measurement (and would wonder why there wasn't
a revision to the documentation if it is a valid event).
Java
- Statistically rigorous Java performance evaluation
- The DaCapo Benchmarks: Java Benchmarking Development and Analysis
- Don’t Trust Your Profiler: An Empirical Study on the Precision and Accuracy of Java Profilers