Async failure · Coq devs & plugin devs

Stream: Coq devs & plugin devs

Topic: Async failure

Théo Zimmermann (Sep 17 2020 at 08:14):

Hello @Enrico Tassi, PR coq/coq#13039 had a failure of the async job with:

Makefile.build:866: recipe for target 'theories/ZArith/Znumtheory.vo' failed
make[1]: *** [theories/ZArith/Znumtheory.vo] Segmentation fault (core dumped)

(see https://gitlab.com/coq/coq/-/jobs/741853901)

Are you interested in being notified about those failures?

Enrico Tassi (Sep 17 2020 at 08:16):

Yes thanks

Enrico Tassi (Sep 17 2020 at 08:39):

I can't find the value of NJOBS in general. I see bedrocks gets a 1, but what is the default?
My problem is that I cannot reproduce it locally, and I start to think that these core dumps are actually about an out of memory. shared runners have 4G ram and async-proofs on potentially starts 2 coq processes per file with the same memory footprint.
How many processe are we running?

Théo Zimmermann (Sep 17 2020 at 09:14):

If you go to the top of the log, you get a lot of info.

Théo Zimmermann (Sep 17 2020 at 09:14):

In particular, NJOBS=2

Théo Zimmermann (Sep 17 2020 at 09:15):

cpu cores : 1

Paolo Giarrusso (Sep 17 2020 at 09:30):

FWIW, the out-of-memory killer shouldn’t use SIGSEGV — but some code segfaults when allocation fails is very plausible (such bugs are really easy in systems code)

Paolo Giarrusso (Sep 17 2020 at 09:30):

(I remember a few in the GHC runtime system)

Paolo Giarrusso (Sep 17 2020 at 09:31):

but the end result is probably the same — if this were a Coq bug, I don’t expect coq should fix it (if it were in OCaml, OTOH)

Enrico Tassi (Sep 17 2020 at 11:02):

As usual re-running the job on pyrolise that has 32G memory made it pass; having 4 coq processes exhaust 4G is plausible.

I'd go for NJOBS=1 (which will also make things more reproducible) but then I guess I should also increase the timeout a little.
And add a call to dmesg on failure if possible (is there a kwd for that?). Are you OK with this plan @Théo Zimmermann ?

Enrico Tassi (Sep 17 2020 at 11:03):

Maybe I should just add dmesg to validate the thesis.

Théo Zimmermann (Sep 17 2020 at 11:04):

Yes, that sounds good.

Théo Zimmermann (Sep 17 2020 at 11:04):

I believe that after_script is run in all cases, failure or not.

Théo Zimmermann (Sep 17 2020 at 11:04):

Let me check.

Théo Zimmermann (Sep 17 2020 at 11:04):

Yes: https://docs.gitlab.com/ee/ci/yaml/#before_script-and-after_script

Enrico Tassi (Sep 17 2020 at 11:13):

I've opened a PR

Théo Zimmermann (Dec 10 2022 at 21:23):

My current backport batch keeps failing in the test-suite:base+anync job with:

==========> TESTING bugs/bug_16803.v <==========
Error: The command has not failed!
0m0.000000s 0m0.000000s
0m2.670000s 0m1.190000s
==========> FAILURE <==========
    bugs/bug_16803.v...Error! (bug seems to be opened, please check)
FAILURES
    bugs/bug_16803.v...Error! (bug seems to be opened, please check)

Any idea what could be going on? The diff is here: https://github.com/coq/coq/compare/v8.17...Zimmi48:coq:staging-v8.17
I don't think that anything in it could be responsible for this issue.

Gaëtan Gilbert (Dec 10 2022 at 21:41):

https://github.com/coq/coq/pull/16952 ?

Gaëtan Gilbert (Dec 10 2022 at 21:41):

yeah the test was added in https://github.com/coq/coq/pull/16810 so 16952 needs backporting

Théo Zimmermann (Dec 11 2022 at 18:03):

Thanks!

Last updated: Apr 20 2024 at 00:02 UTC