Stream: Coq devs & plugin devs

Topic: Test-suite timeout in the CI


view this post on Zulip Pierre-Marie Pédrot (Aug 18 2022 at 11:10):

Recently, it seems that the test-suite:base CI job tends to time out fairly regularly. This is weird, given that the time limit is set to 3h, which looks like a lot to me. On my not especially fast machine, the test-suite takes about 10 minutes with -j4. Does anybody know why we see this?

view this post on Zulip Enrico Tassi (Aug 18 2022 at 12:26):

I did see the async job get stuck as well. I could not figure out which file it was. Maybe it is the same issue, just more frequent on async.

view this post on Zulip Pierre-Marie Pédrot (Aug 18 2022 at 12:28):

Async always times out, because it's really slower in practice.

view this post on Zulip Enrico Tassi (Aug 18 2022 at 12:34):

I did raise the timeout by 1h and it gave the very same log. Yes it is slower, but not that much.

view this post on Zulip Pierre-Marie Pédrot (Aug 18 2022 at 13:05):

Maybe we should have more info in the test-suite log, like e.g. the time passed in each test.

view this post on Zulip Enrico Tassi (Aug 18 2022 at 14:46):

the real problem is j2, that way it is hard to find which test is stuck.

view this post on Zulip Michael Soegtrop (Aug 19 2022 at 07:44):

It might also make sense to have an eye on the memory consumption - CI runner speed can vary by an order of magnitude if memory demands are high - it might run just fine one day and swap heavily another day when memory pressure is high from other jobs sharing the same machine. It also would make sense to run jobs with high memory consumption sequentially.

view this post on Zulip Michael Soegtrop (Aug 19 2022 at 07:45):

(or in parallel with jobs with low memory consumption).

view this post on Zulip Pierre-Marie Pédrot (Aug 19 2022 at 11:31):

Swapping would explain a lot.

view this post on Zulip Maxime Dénès (Sep 01 2022 at 10:12):

Aren't all jobs getting stuck on SchemeEquality.v right now? https://gitlab.com/coq/coq/-/jobs/2964549274

view this post on Zulip Enrico Tassi (Sep 01 2022 at 10:15):

it is the first time I see one of these stuck logs ending in a TEST line, so maybe we have now a place to start debugging the issue.

view this post on Zulip Gaëtan Gilbert (Sep 01 2022 at 11:29):

a normal test suite run should be like https://gitlab.com/coq/coq/-/jobs/2930430420 (40min)
sometimes it takes much longer but still succeeds https://gitlab.com/coq/coq/-/jobs/2954469239 (157min)
and sometimes it times out (180min)
I don't have any clue why

view this post on Zulip Maxime Dénès (Sep 01 2022 at 12:00):

Aren't we getting close to a memory limit? All pending jobs I observed were spending a long long time on SchemeEquality.v.

view this post on Zulip Théo Zimmermann (Sep 01 2022 at 12:05):

@Jason Gross has some code to gather data on past CI runs, we could use this to check if there is a recent trend toward longer test-suite builds...

view this post on Zulip Maxime Dénès (Sep 01 2022 at 12:14):

There must be something wrong with this particular file

view this post on Zulip Enrico Tassi (Sep 01 2022 at 12:17):

There is a large case taken from Elpi's test suite...

view this post on Zulip Pierre-Marie Pédrot (Sep 01 2022 at 12:22):

I've stumbled upon a blowup in scheme creation due to some naive algorithmic recently, maybe it's related. I even have a patch somewhere on my drive.

view this post on Zulip Gaëtan Gilbert (Sep 01 2022 at 12:45):

running it locally the Scheme Equality for large. takes single digit seconds and doesn't seem to spike memory, that doesn't seem like enough to break CI
maybe something in native compile?

view this post on Zulip Pierre-Marie Pédrot (Sep 01 2022 at 12:51):

aren't tests run with -j4? Last theory en vogue was that it would barely reach the low memory limits of the worker and start swapping

view this post on Zulip Gaëtan Gilbert (Sep 01 2022 at 12:52):

-j2

view this post on Zulip Jason Gross (Sep 01 2022 at 13:45):

The data collection scripts are available in https://github.com/JasonGross/coq-bug-minimizer-paper/tree/main/presentation but are not particularly well-organized. I can give more instructions on running them if you want to collect the relevant data

view this post on Zulip Paolo Giarrusso (Sep 01 2022 at 18:43):

With TIMED=1 (and maybe tweaking the exact TIMECMD) you already get memory use and user/real/system timing; one could also ask for page faults since you suspect swapping...


Last updated: Feb 01 2023 at 16:03 UTC