Does the Coq project have noise measurements for the current benchmark setup? Maybe it would make sense to fit statistical models to repeated benchmarks of master vs. master to see what the noise distribution looks like. If this data does not exist, would it be possible to maybe run one benchmark of master against master (i.e. one where we expect zero difference) every night for a month or so? That should give us plenty of data and it might help with determining if small differences are actually statistically relevant.
We do occasionally bench master against master. Here is an example: https://github.com/coq/coq/pull/15735#issuecomment-1068122486. AFAIK remembering the most recent case is typically the best noise indication we have. I think that it would be a good project to do some statistical analysis of master on master benches, but I don't personally know how that can be done.
If I were to run the bench again when the workers are busy or free the noise might change significantly. But it seems to be a good rule of thumb to take +-1% on most projects.
The bench log I linked btw has the line by line timings in it (that was the point of the PR). You can see how noisy some lines in the CI can be (scroll up). https://gitlab.com/coq/coq/-/jobs/2202936719
old new diff pdiff ln file
│ 34.6800 33.7690 -0.9110 -2.63% 10 coq-fourcolor/theories/job323to383.v.html │
│ 53.9250 55.1930 1.2680 2.35% 587 coq-unimath/UniMath/SubstitutionSystems/SignatureExamples.v.html │
I think it would be entirely possible to run the bench on a weekly basis (maybe a few more times if nobody is using it) to collect some data. How useful that data will actually be is a question I don't know the answer to.
The per line summary looks super useful!
Last updated: Dec 05 2023 at 11:01 UTC