Stream: Coq devs & plugin devs

Topic: CI failing


view this post on Zulip Pierre-Marie Pédrot (Jul 21 2021 at 14:37):

is there a heat wave where the servers are located or what?
https://github.com/coq/coq/pull/14690/checks?check_run_id=3124940152

view this post on Zulip Pierre-Marie Pédrot (Jul 21 2021 at 14:37):

this is happening on several unrelated PRs

view this post on Zulip Pierre-Marie Pédrot (Jul 21 2021 at 14:38):

maybe related to bedrock signal 9 from outer space?

view this post on Zulip Enrico Tassi (Jul 21 2021 at 15:26):

It must be Bezos sending it

view this post on Zulip Pierre-Marie Pédrot (Jul 21 2021 at 19:01):

It seems that there is a memory problem with these devs. I don't know if we introduced a Coq patch that increased the memory consumption or if there is something about these devs that was changed recently, but it should be fixed quickly.

view this post on Zulip Pierre-Marie Pédrot (Jul 21 2021 at 19:03):

cc @Jason Gross since two of the failing devs are engine-bench and fiat-crypto

view this post on Zulip Pierre-Marie Pédrot (Jul 21 2021 at 19:04):

the failures are either signal 9 (probably the kernel firing lasers after OOM) or error 137 (seems to be docker way to signal OOM)

view this post on Zulip Guillaume Melquiond (Jul 21 2021 at 19:09):

No, 137 means that Docker itself (rather than the containerized program) received signal 9. This might still be due to the OOM-killer, but I do not understand why Docker itself would use tons of memory.

view this post on Zulip Jason Gross (Jul 22 2021 at 00:54):

engine-bench has had literally no recent changes, I believe. I suspect fiat-crypto has not had any relevant recent changes. What files OOM? Can you look at max memory differences between 8.13 and master or something and then bisect?

view this post on Zulip Théo Zimmermann (Jul 22 2021 at 08:13):

137 errors are errors at the level of GitLab CI (my hypothesis is a free runner being preempted for something else). In principle, those are automatically restarted by coqbot. If that's not the case, please open an issue with an example.

view this post on Zulip Pierre-Marie Pédrot (Jul 22 2021 at 09:27):

ci-engine_bench seems to reliably throw 137 errors, see e.g. https://gitlab.com/coq/coq/-/jobs/1443507444

view this post on Zulip Guillaume Melquiond (Jul 22 2021 at 09:34):

constr_eq.vo (real: 104.20, user: 13.12, sys: 1.74, mem: 3018040 ko); 3GB of memory usage? No wonder it fails.

view this post on Zulip Pierre-Marie Pédrot (Jul 22 2021 at 09:37):

I thought we had a few files in the ci that reached about 4GB of memory and that it was considered normal

view this post on Zulip Guillaume Melquiond (Jul 22 2021 at 09:38):

That would be surprising. Gitlab's shared runners are Google's N1 runners, which have 3GB of free RAM.

view this post on Zulip Enrico Tassi (Jul 22 2021 at 09:48):

side question, why a job called *bench is in CI (and not only in bench)?

view this post on Zulip Théo Zimmermann (Jul 22 2021 at 09:51):

I guess that's because you want it to be checked for compatibility so that it works when you want to run the bench...

view this post on Zulip Théo Zimmermann (Jul 22 2021 at 09:53):

Pierre-Marie Pédrot said:

ci-engine_bench seems to reliably throw 137 errors, see e.g. https://gitlab.com/coq/coq/-/jobs/1443507444

Ah OK, then that's an error thrown directly when executing coqc and this is not what I had in mind.

view this post on Zulip Gaëtan Gilbert (Jul 22 2021 at 10:03):

Enrico Tassi said:

side question, why a job called *bench is in CI (and not only in bench)?

what do you mean?

view this post on Zulip Enrico Tassi (Jul 22 2021 at 10:33):

I mean that unless the job also tests features of Coq which are not tested by other ci jobs, then it should not be there in the first place. If it tests the performance of Coq it should be just run in the bench.
@Théo Zimmermann if you want the job not to bitrot, you can then provide a quick target which just checks the ML code (if any) compiles.

view this post on Zulip Théo Zimmermann (Jul 22 2021 at 10:34):

also tests features of Coq which are not tested by other ci jobs

That's hard to predict though...

view this post on Zulip Enrico Tassi (Jul 22 2021 at 12:09):

I agree that in general this is hard. But foobar_bench seems a job designed to test the perf of something, not its functional behavior. So I doubt it tests a dark corner of the system. Maybe the name is misleading, I don't know.

view this post on Zulip Enrico Tassi (Jul 22 2021 at 12:10):

For sure, if you need to test a functionality, it is very likely you can come up with an example which consumes less than 3G of ram ;-)

view this post on Zulip Pierre-Marie Pédrot (Jul 22 2021 at 14:51):

I personally don't have time (nor proper infrastructure) to easily bisect the memory increase, it'd be nice if somebody could do that.

view this post on Zulip Pierre-Marie Pédrot (Jul 22 2021 at 14:51):

it's a recent thing, like ~3 or 4 days ago.

view this post on Zulip Pierre-Marie Pédrot (Jul 22 2021 at 14:52):

(my laptop is already heavily suffering in web browsing mode due to ~40°C outside heat, I am afraid that fiat-crypto is beyond its physical capacities)

view this post on Zulip Gaëtan Gilbert (Jul 22 2021 at 14:56):

engine-bench-lite hasn't changed max memory stats in our bench logs for a while

view this post on Zulip Pierre-Marie Pédrot (Jul 22 2021 at 15:01):

so why is it failing with SIGKILL now?

view this post on Zulip Andrej Dudenhefner (Jul 23 2021 at 09:34):

Guillaume Melquiond said:

constr_eq.vo (real: 104.20, user: 13.12, sys: 1.74, mem: 3018040 ko); 3GB of memory usage? No wonder it fails.

Maybe one should call optimize_heap in ci-engine_bench in constr_eq.v between individual check X000%Z. calls. This reduces memory consumption on my machine by half.

view this post on Zulip Enrico Tassi (Jul 23 2021 at 09:36):

It also makes the timing more reliable, I'm surprised it is not there already.

view this post on Zulip Enrico Tassi (Jul 23 2021 at 09:38):

It would be even better to split the file then, and let the OS "manage" the heap.

view this post on Zulip Andrej Dudenhefner (Jul 23 2021 at 09:42):

I'm not familiar with the project (I also do not know how the results are evaluated wrt. refactoring / additional commands). I'm just annoyed with the current state of ci. @Enrico Tassi would you care to create a small optimize_heap PR?

view this post on Zulip Enrico Tassi (Jul 23 2021 at 09:50):

We could also remove that job from CI, it is much quicker, and ask his author to fix it

view this post on Zulip Enrico Tassi (Jul 23 2021 at 09:51):

@Pierre-Marie Pédrot I did looked at the PR which were merged in the last 3 days and nothing suspicious showed up

view this post on Zulip Pierre-Marie Pédrot (Jul 23 2021 at 10:17):

It's a bit weird that the memory is not already reclaimed between tactic invocations in the constr_eq file.

view this post on Zulip Pierre-Marie Pédrot (Jul 23 2021 at 10:18):

I don't see what is keeping it alive a priori.

view this post on Zulip Andrej Dudenhefner (Jul 23 2021 at 10:21):

Pierre-Marie Pédrot said:

I don't see what is keeping it alive a priori.

Maybe it is worth to invoke Gc.minor (or even Gc.major) at some key points?

view this post on Zulip Pierre-Marie Pédrot (Jul 23 2021 at 10:27):

We tried once to call Gc.major at every line, it was terrible performancewise.

view this post on Zulip Enrico Tassi (Jul 23 2021 at 11:25):

Pierre-Marie Pédrot said:

I don't see what is keeping it alive a priori.

It's not alive, it's just that OCaml's Gc compacts the heap only when the pressure is high. And that is expensive. So, in benchmarks, you should call it by hand when you don't measure, otherwise you may count as part of a command the time it takes to compact the heap after the execution of another commands, which is fuzzying your measurement.

view this post on Zulip Enrico Tassi (Jul 23 2021 at 11:27):

And if you -j 2 then one job which is using 3G may be paired with another one which suddenly uses 1G and the Gc policy is per process, so maybe no one compacts and the kernel decides to kill one.

view this post on Zulip Enrico Tassi (Jul 23 2021 at 11:30):

I'll open a PR to turn the job off

view this post on Zulip Pierre-Marie Pédrot (Jul 23 2021 at 11:44):

We can also just add optimize heaps in the test, I have written the patch already.

view this post on Zulip Pierre-Marie Pédrot (Jul 23 2021 at 11:46):

https://github.com/mit-plv/engine-bench/pull/2

view this post on Zulip Pierre-Marie Pédrot (Jul 23 2021 at 11:47):

it's still weird that other devs are also failing with memory errors, something must have happened

view this post on Zulip Pierre-Marie Pédrot (Jul 23 2021 at 11:48):

we can't just disable all those devs in the ci...

view this post on Zulip Enrico Tassi (Jul 23 2021 at 13:32):

Command terminated by signal 9
.../bedrock2/.../FlatToRiscvFunctions.vo (real: 743.93, user: 730.77, sys: 2.26, mem: 3226360 ko)

BTW, a while ago I'm pretty sure these shared runner had 4G or RAM. Am I dreaming?

view this post on Zulip Guillaume Melquiond (Jul 23 2021 at 13:37):

No, as I wrote earlier, Gitlab's shared runners are Google's N1 runners, so they have 3GB of free RAM. (More precisely, they have 6.5GB shared between several runners, at least two.)

view this post on Zulip Andrej Dudenhefner (Jul 23 2021 at 13:53):

For a quick fix, a few optimize_heap strategically placed in FlatToRiscvFunctions.vshould reduce memory consumption below 3GB.

view this post on Zulip Andrej Dudenhefner (Jul 23 2021 at 17:07):

Pierre-Marie Pédrot said:

https://github.com/mit-plv/engine-bench/pull/2

Peak memory usage (possibly the last check 9000%Z.) is still too much for ci despite intermediate heap optimization.

view this post on Zulip Ali Caglayan (Sep 02 2021 at 14:10):

The category theory CI job is failing everywhere as well as the windows64

view this post on Zulip Ali Caglayan (Sep 02 2021 at 14:11):

@John Wiegley

view this post on Zulip Ali Caglayan (Sep 02 2021 at 14:12):

ah but I think this is on our CI side

view this post on Zulip Ali Caglayan (Sep 02 2021 at 14:12):

https://github.com/coq/coq/runs/3496327745

view this post on Zulip Ali Caglayan (Sep 02 2021 at 14:12):

weird interaction with equations

view this post on Zulip Ali Caglayan (Sep 02 2021 at 14:12):

anybody understand what is going on?

view this post on Zulip Gaëtan Gilbert (Sep 02 2021 at 14:28):

https://github.com/coq/coq/pull/14839

view this post on Zulip Théo Zimmermann (Sep 02 2021 at 14:52):

I wasn't aware that there was also an issue on our side, but it seems that this is not the only issue.
Cf. https://coq.zulipchat.com/#narrow/stream/237655-Miscellaneous/topic/category-theory.20library.20in.20Coq's.20CI
and: https://github.com/jwiegley/category-theory/commit/7010b266cf3b1b1f7970a0aba8521b7716dee363
and: https://github.com/Zimmi48/category-theory/runs/3493747463?check_suite_focus=true

view this post on Zulip Pierre-Marie Pédrot (Sep 28 2021 at 08:05):

Looks like test-suite:async is reliably failing on master, due to test-suite/bugs/closed/bug_14734.v

view this post on Zulip Pierre-Marie Pédrot (Sep 28 2021 at 08:06):

It was introduced in #14758, ping @Gaëtan Gilbert

view this post on Zulip Gaëtan Gilbert (Sep 28 2021 at 09:10):

https://github.com/coq/coq/pull/14947 and https://github.com/coq/coq/issues/14948

view this post on Zulip Ali Caglayan (Sep 29 2021 at 16:55):

quickchick and coqhammer are failing on the CI

view this post on Zulip Gaëtan Gilbert (Sep 29 2021 at 16:56):

coqhammer is just waiting for overlay

view this post on Zulip Emilio Jesús Gallego Arias (Sep 29 2021 at 17:11):

Quickchick was mentioned about cppo being the wrong version

view this post on Zulip Ali Caglayan (Oct 13 2021 at 07:46):

VSCoq?

view this post on Zulip Théo Zimmermann (Oct 13 2021 at 08:02):

Since when is VsCoq tested in our CI?

view this post on Zulip Karl Palmskog (Oct 13 2021 at 08:04):

well, it's good if it's tested in some CI, due to low activity recently I think it's not getting so much testing overall (and it still uses Travis in the repo!)

view this post on Zulip Gaëtan Gilbert (Oct 13 2021 at 08:16):

it's the document manager branch, not the one people use

view this post on Zulip Théo Zimmermann (Oct 13 2021 at 08:33):

And this branch recently received new commits by @Maxime Dénès I think, so this may be related.

view this post on Zulip Gaëtan Gilbert (Oct 13 2021 at 08:45):

just a missing overlay

view this post on Zulip Emilio Jesús Gallego Arias (Oct 13 2021 at 11:58):

Théo Zimmermann said:

Since when is VsCoq tested in our CI?

Actually you merged its support c.f. https://github.com/coq/coq/pull/14334 ; but indeed, currently the vscoq branch doesn't meet the criteria we have in our CI, I tried to build it to test it and indeed, it doesn't work.

view this post on Zulip Théo Zimmermann (Oct 13 2021 at 12:18):

I did, indeed :face_palm:

view this post on Zulip Enrico Tassi (Oct 13 2021 at 12:42):

As I wrote on a recent PR @Maxime Dénès has a document-manager-new branch which should work and should be merged in the one we test when he has the time.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 13 2021 at 14:04):

As of today none of them work, so that's an issue for the CI

view this post on Zulip Emilio Jesús Gallego Arias (Oct 13 2021 at 14:05):

Given policy if a PR author cannot build an overlay, he may need to drop temporarily that experimental branch from the CI

view this post on Zulip Gaëtan Gilbert (Oct 13 2021 at 14:12):

it does compile up to the overlay that's waiting

view this post on Zulip Emilio Jesús Gallego Arias (Oct 13 2021 at 14:16):

The vscode part doesn't compile, so it is impossible to use / test

view this post on Zulip Enrico Tassi (Oct 13 2021 at 14:20):

This is probably a problem with the node dependencies / vscode version and what you have installed. Clearly a bug to be fixed, but "impossible" is a bit misleading, since I'm pretty sure it works on @Maxime Dénès laptop (I saw it with my eyes).

view this post on Zulip Emilio Jesús Gallego Arias (Oct 13 2021 at 14:25):

Indeed I guess there is a way to make it work, but as of today I follow the instructions and it doesn't work, thus it is not possible for me to test


Last updated: Oct 21 2021 at 21:03 UTC