Performance Experiments are Hard

Running good experiments is obviously hard. This is exemplified by the numerous resources you can find online about the reproducibility crisis in various scientific fields. Naturally, running performance experiments in computer science is also hard, and it is too easy to misinterpret the results or to present data which hides the complete story.

I found it interesting how multiple fields of computer science have papers that talk about benchmarking pitfalls in their domains, and wanted to start compiling a list for myself. Maybe it's useful for you too.

Systems Programming

A note on performance counters. Depending on how paranoid you are, you should double check that the performance counters your tool advertises are actually listed in your processor manual. For example, when using perf list, the tool advertises that I can measure l2_latency.l2_cycles_waiting_on_fills, even though I couldn't find that event anywhere in my Zen3 manual. It seems like the PR that added support for measuring that event added that event because the measured count was non-zero, even though it wasn't listed in the manual. It was submitted and approved by AMD employees, so I might just be overly cautious, but I personally wouldn't feel comfortable using that event for any rigorous measurement (and would wonder why there wasn't a revision to the documentation if it is a valid event).

Java

Networking