Recently, it seems that the test-suite:base CI job tends to time out fairly regularly. This is weird, given that the time limit is set to 3h, which looks like a lot to me. On my not especially fast machine, the test-suite takes about 10 minutes with -j4. Does anybody know why we see this?
I did see the async job get stuck as well. I could not figure out which file it was. Maybe it is the same issue, just more frequent on async.
Async always times out, because it's really slower in practice.
I did raise the timeout by 1h and it gave the very same log. Yes it is slower, but not that much.
Maybe we should have more info in the test-suite log, like e.g. the time passed in each test.
the real problem is j2, that way it is hard to find which test is stuck.
It might also make sense to have an eye on the memory consumption - CI runner speed can vary by an order of magnitude if memory demands are high - it might run just fine one day and swap heavily another day when memory pressure is high from other jobs sharing the same machine. It also would make sense to run jobs with high memory consumption sequentially.
(or in parallel with jobs with low memory consumption).
Swapping would explain a lot.
Aren't all jobs getting stuck on SchemeEquality.v
right now? https://gitlab.com/coq/coq/-/jobs/2964549274
it is the first time I see one of these stuck logs ending in a TEST line, so maybe we have now a place to start debugging the issue.
a normal test suite run should be like https://gitlab.com/coq/coq/-/jobs/2930430420 (40min)
sometimes it takes much longer but still succeeds https://gitlab.com/coq/coq/-/jobs/2954469239 (157min)
and sometimes it times out (180min)
I don't have any clue why
Aren't we getting close to a memory limit? All pending jobs I observed were spending a long long time on SchemeEquality.v
.
@Jason Gross has some code to gather data on past CI runs, we could use this to check if there is a recent trend toward longer test-suite builds...
There must be something wrong with this particular file
There is a large case taken from Elpi's test suite...
I've stumbled upon a blowup in scheme creation due to some naive algorithmic recently, maybe it's related. I even have a patch somewhere on my drive.
running it locally the Scheme Equality for large.
takes single digit seconds and doesn't seem to spike memory, that doesn't seem like enough to break CI
maybe something in native compile?
aren't tests run with -j4? Last theory en vogue was that it would barely reach the low memory limits of the worker and start swapping
-j2
The data collection scripts are available in https://github.com/JasonGross/coq-bug-minimizer-paper/tree/main/presentation but are not particularly well-organized. I can give more instructions on running them if you want to collect the relevant data
With TIMED=1 (and maybe tweaking the exact TIMECMD) you already get memory use and user/real/system timing; one could also ask for page faults since you suspect swapping...
Last updated: May 28 2023 at 13:30 UTC