is it a thing that ci-bedrock2 sometimes fails cause coqc gets killed?
is there a heat wave where the servers are located or what?
https://github.com/coq/coq/pull/14690/checks?check_run_id=3124940152
this is happening on several unrelated PRs
maybe related to bedrock signal 9 from outer space?
It must be Bezos sending it
It seems that there is a memory problem with these devs. I don't know if we introduced a Coq patch that increased the memory consumption or if there is something about these devs that was changed recently, but it should be fixed quickly.
cc @Jason Gross since two of the failing devs are engine-bench and fiat-crypto
the failures are either signal 9 (probably the kernel firing lasers after OOM) or error 137 (seems to be docker way to signal OOM)
No, 137 means that Docker itself (rather than the containerized program) received signal 9. This might still be due to the OOM-killer, but I do not understand why Docker itself would use tons of memory.
engine-bench has had literally no recent changes, I believe. I suspect fiat-crypto has not had any relevant recent changes. What files OOM? Can you look at max memory differences between 8.13 and master or something and then bisect?
137 errors are errors at the level of GitLab CI (my hypothesis is a free runner being preempted for something else). In principle, those are automatically restarted by coqbot. If that's not the case, please open an issue with an example.
ci-engine_bench seems to reliably throw 137 errors, see e.g. https://gitlab.com/coq/coq/-/jobs/1443507444
constr_eq.vo (real: 104.20, user: 13.12, sys: 1.74, mem: 3018040 ko)
; 3GB of memory usage? No wonder it fails.
I thought we had a few files in the ci that reached about 4GB of memory and that it was considered normal
That would be surprising. Gitlab's shared runners are Google's N1 runners, which have 3GB of free RAM.
side question, why a job called *bench is in CI (and not only in bench)?
I guess that's because you want it to be checked for compatibility so that it works when you want to run the bench...
Pierre-Marie Pédrot said:
ci-engine_bench seems to reliably throw 137 errors, see e.g. https://gitlab.com/coq/coq/-/jobs/1443507444
Ah OK, then that's an error thrown directly when executing coqc
and this is not what I had in mind.
Enrico Tassi said:
side question, why a job called *bench is in CI (and not only in bench)?
what do you mean?
I mean that unless the job also tests features of Coq which are not tested by other ci jobs, then it should not be there in the first place. If it tests the performance of Coq it should be just run in the bench.
@Théo Zimmermann if you want the job not to bitrot, you can then provide a quick target which just checks the ML code (if any) compiles.
also tests features of Coq which are not tested by other ci jobs
That's hard to predict though...
I agree that in general this is hard. But foobar_bench seems a job designed to test the perf of something, not its functional behavior. So I doubt it tests a dark corner of the system. Maybe the name is misleading, I don't know.
For sure, if you need to test a functionality, it is very likely you can come up with an example which consumes less than 3G of ram ;-)
I personally don't have time (nor proper infrastructure) to easily bisect the memory increase, it'd be nice if somebody could do that.
it's a recent thing, like ~3 or 4 days ago.
(my laptop is already heavily suffering in web browsing mode due to ~40°C outside heat, I am afraid that fiat-crypto is beyond its physical capacities)
engine-bench-lite hasn't changed max memory stats in our bench logs for a while
so why is it failing with SIGKILL now?
Guillaume Melquiond said:
constr_eq.vo (real: 104.20, user: 13.12, sys: 1.74, mem: 3018040 ko)
; 3GB of memory usage? No wonder it fails.
Maybe one should call optimize_heap
in ci-engine_bench
in constr_eq.v
between individual check X000%Z.
calls. This reduces memory consumption on my machine by half.
It also makes the timing more reliable, I'm surprised it is not there already.
It would be even better to split the file then, and let the OS "manage" the heap.
I'm not familiar with the project (I also do not know how the results are evaluated wrt. refactoring / additional commands). I'm just annoyed with the current state of ci. @Enrico Tassi would you care to create a small optimize_heap
PR?
We could also remove that job from CI, it is much quicker, and ask his author to fix it
@Pierre-Marie Pédrot I did looked at the PR which were merged in the last 3 days and nothing suspicious showed up
It's a bit weird that the memory is not already reclaimed between tactic invocations in the constr_eq file.
I don't see what is keeping it alive a priori.
Pierre-Marie Pédrot said:
I don't see what is keeping it alive a priori.
Maybe it is worth to invoke Gc.minor
(or even Gc.major
) at some key points?
We tried once to call Gc.major at every line, it was terrible performancewise.
Pierre-Marie Pédrot said:
I don't see what is keeping it alive a priori.
It's not alive, it's just that OCaml's Gc compacts the heap only when the pressure is high. And that is expensive. So, in benchmarks, you should call it by hand when you don't measure, otherwise you may count as part of a command the time it takes to compact the heap after the execution of another commands, which is fuzzying your measurement.
And if you -j 2 then one job which is using 3G may be paired with another one which suddenly uses 1G and the Gc policy is per process, so maybe no one compacts and the kernel decides to kill one.
I'll open a PR to turn the job off
We can also just add optimize heaps in the test, I have written the patch already.
https://github.com/mit-plv/engine-bench/pull/2
it's still weird that other devs are also failing with memory errors, something must have happened
we can't just disable all those devs in the ci...
Command terminated by signal 9
.../bedrock2/.../FlatToRiscvFunctions.vo (real: 743.93, user: 730.77, sys: 2.26, mem: 3226360 ko)
BTW, a while ago I'm pretty sure these shared runner had 4G or RAM. Am I dreaming?
No, as I wrote earlier, Gitlab's shared runners are Google's N1 runners, so they have 3GB of free RAM. (More precisely, they have 6.5GB shared between several runners, at least two.)
For a quick fix, a few optimize_heap
strategically placed in FlatToRiscvFunctions.v
should reduce memory consumption below 3GB.
Pierre-Marie Pédrot said:
Peak memory usage (possibly the last check 9000%Z.
) is still too much for ci despite intermediate heap optimization.
The category theory CI job is failing everywhere as well as the windows64
@John Wiegley
ah but I think this is on our CI side
https://github.com/coq/coq/runs/3496327745
weird interaction with equations
anybody understand what is going on?
https://github.com/coq/coq/pull/14839
I wasn't aware that there was also an issue on our side, but it seems that this is not the only issue.
Cf. https://coq.zulipchat.com/#narrow/stream/237655-Miscellaneous/topic/category-theory.20library.20in.20Coq's.20CI
and: https://github.com/jwiegley/category-theory/commit/7010b266cf3b1b1f7970a0aba8521b7716dee363
and: https://github.com/Zimmi48/category-theory/runs/3493747463?check_suite_focus=true
Looks like test-suite:async is reliably failing on master, due to test-suite/bugs/closed/bug_14734.v
It was introduced in #14758, ping @Gaëtan Gilbert
https://github.com/coq/coq/pull/14947 and https://github.com/coq/coq/issues/14948
quickchick and coqhammer are failing on the CI
coqhammer is just waiting for overlay
Quickchick was mentioned about cppo being the wrong version
VSCoq?
Since when is VsCoq tested in our CI?
well, it's good if it's tested in some CI, due to low activity recently I think it's not getting so much testing overall (and it still uses Travis in the repo!)
it's the document manager branch, not the one people use
And this branch recently received new commits by @Maxime Dénès I think, so this may be related.
just a missing overlay
Théo Zimmermann said:
Since when is VsCoq tested in our CI?
Actually you merged its support c.f. https://github.com/coq/coq/pull/14334 ; but indeed, currently the vscoq branch doesn't meet the criteria we have in our CI, I tried to build it to test it and indeed, it doesn't work.
I did, indeed :face_palm:
As I wrote on a recent PR @Maxime Dénès has a document-manager-new
branch which should work and should be merged in the one we test when he has the time.
As of today none of them work, so that's an issue for the CI
Given policy if a PR author cannot build an overlay, he may need to drop temporarily that experimental branch from the CI
it does compile up to the overlay that's waiting
The vscode part doesn't compile, so it is impossible to use / test
This is probably a problem with the node dependencies / vscode version and what you have installed. Clearly a bug to be fixed, but "impossible" is a bit misleading, since I'm pretty sure it works on @Maxime Dénès laptop (I saw it with my eyes).
Indeed I guess there is a way to make it work, but as of today I follow the instructions and it doesn't work, thus it is not possible for me to test
Why do nix builds fail on main repo? https://github.com/coq/coq/runs/4016008438
We've been in the red for a few weeks now
https://gitlab.com/coq/coq/-/jobs/1718558727
See https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs.20.26.20plugin.20devs/topic/pkg.3Anix.3Adeploy.20unhappy. The reason for this failure is not clear. But what is clear is that these jobs only run on the master
branch, so this is why it does affect the CI of PRs.
Last updated: Jun 09 2023 at 07:01 UTC