CI failing · Coq devs & plugin devs

Stream: Coq devs & plugin devs

Topic: CI failing

Enrico Tassi (Jul 21 2021 at 12:25):

is it a thing that ci-bedrock2 sometimes fails cause coqc gets killed?

Pierre-Marie Pédrot (Jul 21 2021 at 14:37):

is there a heat wave where the servers are located or what?
https://github.com/coq/coq/pull/14690/checks?check_run_id=3124940152

Pierre-Marie Pédrot (Jul 21 2021 at 14:37):

this is happening on several unrelated PRs

Pierre-Marie Pédrot (Jul 21 2021 at 14:38):

maybe related to bedrock signal 9 from outer space?

Enrico Tassi (Jul 21 2021 at 15:26):

It must be Bezos sending it

Pierre-Marie Pédrot (Jul 21 2021 at 19:01):

It seems that there is a memory problem with these devs. I don't know if we introduced a Coq patch that increased the memory consumption or if there is something about these devs that was changed recently, but it should be fixed quickly.

Pierre-Marie Pédrot (Jul 21 2021 at 19:03):

cc @Jason Gross since two of the failing devs are engine-bench and fiat-crypto

Pierre-Marie Pédrot (Jul 21 2021 at 19:04):

the failures are either signal 9 (probably the kernel firing lasers after OOM) or error 137 (seems to be docker way to signal OOM)

Guillaume Melquiond (Jul 21 2021 at 19:09):

No, 137 means that Docker itself (rather than the containerized program) received signal 9. This might still be due to the OOM-killer, but I do not understand why Docker itself would use tons of memory.

Jason Gross (Jul 22 2021 at 00:54):

engine-bench has had literally no recent changes, I believe. I suspect fiat-crypto has not had any relevant recent changes. What files OOM? Can you look at max memory differences between 8.13 and master or something and then bisect?

Théo Zimmermann (Jul 22 2021 at 08:13):

137 errors are errors at the level of GitLab CI (my hypothesis is a free runner being preempted for something else). In principle, those are automatically restarted by coqbot. If that's not the case, please open an issue with an example.

Pierre-Marie Pédrot (Jul 22 2021 at 09:27):

ci-engine_bench seems to reliably throw 137 errors, see e.g. https://gitlab.com/coq/coq/-/jobs/1443507444

Guillaume Melquiond (Jul 22 2021 at 09:34):

constr_eq.vo (real: 104.20, user: 13.12, sys: 1.74, mem: 3018040 ko); 3GB of memory usage? No wonder it fails.

Pierre-Marie Pédrot (Jul 22 2021 at 09:37):

I thought we had a few files in the ci that reached about 4GB of memory and that it was considered normal

Guillaume Melquiond (Jul 22 2021 at 09:38):

That would be surprising. Gitlab's shared runners are Google's N1 runners, which have 3GB of free RAM.

Enrico Tassi (Jul 22 2021 at 09:48):

side question, why a job called *bench is in CI (and not only in bench)?

Théo Zimmermann (Jul 22 2021 at 09:51):

I guess that's because you want it to be checked for compatibility so that it works when you want to run the bench...

Théo Zimmermann (Jul 22 2021 at 09:53):

Pierre-Marie Pédrot said:

ci-engine_bench seems to reliably throw 137 errors, see e.g. https://gitlab.com/coq/coq/-/jobs/1443507444

Ah OK, then that's an error thrown directly when executing coqc and this is not what I had in mind.

Gaëtan Gilbert (Jul 22 2021 at 10:03):

Enrico Tassi said:

side question, why a job called *bench is in CI (and not only in bench)?

what do you mean?

Enrico Tassi (Jul 22 2021 at 10:33):

I mean that unless the job also tests features of Coq which are not tested by other ci jobs, then it should not be there in the first place. If it tests the performance of Coq it should be just run in the bench.
@Théo Zimmermann if you want the job not to bitrot, you can then provide a quick target which just checks the ML code (if any) compiles.

Théo Zimmermann (Jul 22 2021 at 10:34):

also tests features of Coq which are not tested by other ci jobs

That's hard to predict though...

Enrico Tassi (Jul 22 2021 at 12:09):

I agree that in general this is hard. But foobar_bench seems a job designed to test the perf of something, not its functional behavior. So I doubt it tests a dark corner of the system. Maybe the name is misleading, I don't know.

Enrico Tassi (Jul 22 2021 at 12:10):

For sure, if you need to test a functionality, it is very likely you can come up with an example which consumes less than 3G of ram ;-)

Pierre-Marie Pédrot (Jul 22 2021 at 14:51):

I personally don't have time (nor proper infrastructure) to easily bisect the memory increase, it'd be nice if somebody could do that.

Pierre-Marie Pédrot (Jul 22 2021 at 14:51):

it's a recent thing, like ~3 or 4 days ago.

Pierre-Marie Pédrot (Jul 22 2021 at 14:52):

(my laptop is already heavily suffering in web browsing mode due to ~40°C outside heat, I am afraid that fiat-crypto is beyond its physical capacities)

Gaëtan Gilbert (Jul 22 2021 at 14:56):

engine-bench-lite hasn't changed max memory stats in our bench logs for a while

Pierre-Marie Pédrot (Jul 22 2021 at 15:01):

so why is it failing with SIGKILL now?

Andrej Dudenhefner (Jul 23 2021 at 09:34):

Guillaume Melquiond said:

constr_eq.vo (real: 104.20, user: 13.12, sys: 1.74, mem: 3018040 ko); 3GB of memory usage? No wonder it fails.

Maybe one should call optimize_heap in ci-engine_bench in constr_eq.v between individual check X000%Z. calls. This reduces memory consumption on my machine by half.

Enrico Tassi (Jul 23 2021 at 09:36):

It also makes the timing more reliable, I'm surprised it is not there already.

Enrico Tassi (Jul 23 2021 at 09:38):

It would be even better to split the file then, and let the OS "manage" the heap.

Andrej Dudenhefner (Jul 23 2021 at 09:42):

I'm not familiar with the project (I also do not know how the results are evaluated wrt. refactoring / additional commands). I'm just annoyed with the current state of ci. @Enrico Tassi would you care to create a small optimize_heap PR?

Enrico Tassi (Jul 23 2021 at 09:50):

We could also remove that job from CI, it is much quicker, and ask his author to fix it

Enrico Tassi (Jul 23 2021 at 09:51):

@Pierre-Marie Pédrot I did looked at the PR which were merged in the last 3 days and nothing suspicious showed up

Pierre-Marie Pédrot (Jul 23 2021 at 10:17):

It's a bit weird that the memory is not already reclaimed between tactic invocations in the constr_eq file.

Pierre-Marie Pédrot (Jul 23 2021 at 10:18):

I don't see what is keeping it alive a priori.

Andrej Dudenhefner (Jul 23 2021 at 10:21):

Pierre-Marie Pédrot said:

I don't see what is keeping it alive a priori.

Maybe it is worth to invoke Gc.minor (or even Gc.major) at some key points?

Pierre-Marie Pédrot (Jul 23 2021 at 10:27):

We tried once to call Gc.major at every line, it was terrible performancewise.

Enrico Tassi (Jul 23 2021 at 11:25):

Pierre-Marie Pédrot said:

I don't see what is keeping it alive a priori.

It's not alive, it's just that OCaml's Gc compacts the heap only when the pressure is high. And that is expensive. So, in benchmarks, you should call it by hand when you don't measure, otherwise you may count as part of a command the time it takes to compact the heap after the execution of another commands, which is fuzzying your measurement.

Enrico Tassi (Jul 23 2021 at 11:27):

And if you -j 2 then one job which is using 3G may be paired with another one which suddenly uses 1G and the Gc policy is per process, so maybe no one compacts and the kernel decides to kill one.

Enrico Tassi (Jul 23 2021 at 11:30):

I'll open a PR to turn the job off

Pierre-Marie Pédrot (Jul 23 2021 at 11:44):

We can also just add optimize heaps in the test, I have written the patch already.

Pierre-Marie Pédrot (Jul 23 2021 at 11:46):

https://github.com/mit-plv/engine-bench/pull/2

Pierre-Marie Pédrot (Jul 23 2021 at 11:47):

it's still weird that other devs are also failing with memory errors, something must have happened

Pierre-Marie Pédrot (Jul 23 2021 at 11:48):

we can't just disable all those devs in the ci...

Enrico Tassi (Jul 23 2021 at 13:32):

Command terminated by signal 9
.../bedrock2/.../FlatToRiscvFunctions.vo (real: 743.93, user: 730.77, sys: 2.26, mem: 3226360 ko)

BTW, a while ago I'm pretty sure these shared runner had 4G or RAM. Am I dreaming?

Guillaume Melquiond (Jul 23 2021 at 13:37):

No, as I wrote earlier, Gitlab's shared runners are Google's N1 runners, so they have 3GB of free RAM. (More precisely, they have 6.5GB shared between several runners, at least two.)

Andrej Dudenhefner (Jul 23 2021 at 13:53):

For a quick fix, a few optimize_heap strategically placed in FlatToRiscvFunctions.vshould reduce memory consumption below 3GB.

Andrej Dudenhefner (Jul 23 2021 at 17:07):

Pierre-Marie Pédrot said:

https://github.com/mit-plv/engine-bench/pull/2

Peak memory usage (possibly the last check 9000%Z.) is still too much for ci despite intermediate heap optimization.

Ali Caglayan (Sep 02 2021 at 14:10):

The category theory CI job is failing everywhere as well as the windows64

Ali Caglayan (Sep 02 2021 at 14:11):

@John Wiegley

Ali Caglayan (Sep 02 2021 at 14:12):

ah but I think this is on our CI side

Ali Caglayan (Sep 02 2021 at 14:12):

https://github.com/coq/coq/runs/3496327745

Ali Caglayan (Sep 02 2021 at 14:12):

weird interaction with equations

Ali Caglayan (Sep 02 2021 at 14:12):

anybody understand what is going on?

Gaëtan Gilbert (Sep 02 2021 at 14:28):

https://github.com/coq/coq/pull/14839

Théo Zimmermann (Sep 02 2021 at 14:52):

I wasn't aware that there was also an issue on our side, but it seems that this is not the only issue.
Cf. https://coq.zulipchat.com/#narrow/stream/237655-Miscellaneous/topic/category-theory.20library.20in.20Coq's.20CI
and: https://github.com/jwiegley/category-theory/commit/7010b266cf3b1b1f7970a0aba8521b7716dee363
and: https://github.com/Zimmi48/category-theory/runs/3493747463?check_suite_focus=true

Pierre-Marie Pédrot (Sep 28 2021 at 08:05):

Looks like test-suite:async is reliably failing on master, due to test-suite/bugs/closed/bug_14734.v

Pierre-Marie Pédrot (Sep 28 2021 at 08:06):

It was introduced in #14758, ping @Gaëtan Gilbert

Gaëtan Gilbert (Sep 28 2021 at 09:10):

https://github.com/coq/coq/pull/14947 and https://github.com/coq/coq/issues/14948

Ali Caglayan (Sep 29 2021 at 16:55):

quickchick and coqhammer are failing on the CI

Gaëtan Gilbert (Sep 29 2021 at 16:56):

coqhammer is just waiting for overlay

Emilio Jesús Gallego Arias (Sep 29 2021 at 17:11):

Quickchick was mentioned about cppo being the wrong version

Ali Caglayan (Oct 13 2021 at 07:46):

VSCoq?

Théo Zimmermann (Oct 13 2021 at 08:02):

Since when is VsCoq tested in our CI?

Karl Palmskog (Oct 13 2021 at 08:04):

well, it's good if it's tested in some CI, due to low activity recently I think it's not getting so much testing overall (and it still uses Travis in the repo!)

Gaëtan Gilbert (Oct 13 2021 at 08:16):

it's the document manager branch, not the one people use

Théo Zimmermann (Oct 13 2021 at 08:33):

And this branch recently received new commits by @Maxime Dénès I think, so this may be related.

Gaëtan Gilbert (Oct 13 2021 at 08:45):

just a missing overlay

Emilio Jesús Gallego Arias (Oct 13 2021 at 11:58):

Théo Zimmermann said:

Since when is VsCoq tested in our CI?

Actually you merged its support c.f. https://github.com/coq/coq/pull/14334 ; but indeed, currently the vscoq branch doesn't meet the criteria we have in our CI, I tried to build it to test it and indeed, it doesn't work.

Théo Zimmermann (Oct 13 2021 at 12:18):

I did, indeed :face_palm:

Enrico Tassi (Oct 13 2021 at 12:42):

As I wrote on a recent PR @Maxime Dénès has a document-manager-new branch which should work and should be merged in the one we test when he has the time.

Emilio Jesús Gallego Arias (Oct 13 2021 at 14:04):

As of today none of them work, so that's an issue for the CI

Emilio Jesús Gallego Arias (Oct 13 2021 at 14:05):

Given policy if a PR author cannot build an overlay, he may need to drop temporarily that experimental branch from the CI

Gaëtan Gilbert (Oct 13 2021 at 14:12):

it does compile up to the overlay that's waiting

Emilio Jesús Gallego Arias (Oct 13 2021 at 14:16):

The vscode part doesn't compile, so it is impossible to use / test

Enrico Tassi (Oct 13 2021 at 14:20):

This is probably a problem with the node dependencies / vscode version and what you have installed. Clearly a bug to be fixed, but "impossible" is a bit misleading, since I'm pretty sure it works on @Maxime Dénès laptop (I saw it with my eyes).

Emilio Jesús Gallego Arias (Oct 13 2021 at 14:25):

Indeed I guess there is a way to make it work, but as of today I follow the instructions and it doesn't work, thus it is not possible for me to test

Ali Caglayan (Oct 26 2021 at 23:14):

Why do nix builds fail on main repo? https://github.com/coq/coq/runs/4016008438

Ali Caglayan (Oct 26 2021 at 23:15):

We've been in the red for a few weeks now

Ali Caglayan (Oct 26 2021 at 23:17):

https://gitlab.com/coq/coq/-/jobs/1718558727

Théo Zimmermann (Oct 27 2021 at 08:20):

See https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs.20.26.20plugin.20devs/topic/pkg.3Anix.3Adeploy.20unhappy. The reason for this failure is not clear. But what is clear is that these jobs only run on the master branch, so this is why it does affect the CI of PRs.

Last updated: Apr 20 2024 at 01:41 UTC