The test-suite is failing frequently (about 10%) of the time on Mac OS. The failure I'm seeing in at least one of these cases is
File "./modules/PO.v", line 51, characters 0-12:
Error:
/Users/runner/runners/2.166.4/work/1/s/test-suite/modules/Nat.vo: premature end of file. Try to rebuild it.
0m0.001s 0m0.001s
0m0.084s 0m0.050s
==========> FAILURE <==========
modules/PO.v...Error! (should be accepted)
FAILURES
modules/PO.v...Error! (should be accepted)
What's up with this?
https://github.com/coq/coq/pull/12129/checks?check_run_id=659402354
I have this from time to time too (as well as for another modules
test) but did not investigated. If you tell it is MacOS X specific then I could pay more attention to it when it arrives.
Any hints on what to think about this refman CI failure:
Latexmk: calling xindy( -L english -C utf8 -M sphinx.xdy -I xelatex -o CoqRefMan.ind CoqRefMan.idx )
Latexmk: applying rule 'makeindex CoqRefMan.idx'...
Warning: reserving address range 0x7ff2855cd000...0x8020a1572fff that contains memory mappings. clisp might crash later!
Is the runner pyrolyse? See coq/coq#12280.
@Hugo Herbelin I'm not sure if it's Mac OS specific, but this is the first I've seen it, and it was on Mac OS.
I noticed (while working on coq/coq#11992) that there's a crash in fiat_parsers CI: "File "interp/constrextern.ml", line 844, characters 9-15: Assertion failed."
, even when CI is passing (don't ask me, no idea :) — e.g. grep for Anomaly in https://travis-ci.org/github/mit-plv/fiat/jobs/683994964 (web ui) / https://api.travis-ci.org/v3/job/683994964/log.txt (raw text). Is that a known Coq bug? /cc @Jason Gross
Very strange. This is because I generally assume that if make
fails, then make || make
will also fail, but here it does not. Let me fix that on the CI.
In any case, this seems to be a new bug in Coq, where somehow the anomaly occurs the first time you coq the file, but not the second time
Hm, this is actually pretty tricky to fix on my end, due to how my automation is set up. I'm also not sure that it's the job of my own CI to catch heisenanomalies in Coq... (Note that the build succeeds on Coq's CI)
Is there a runner shortage right now? I have a PR that doesn't start any CI run after quite a while, except for windows tests .
@Pierre-Marie Pédrot You mean this is not the usual GitLab CI pipeline exceeded error?
Oh indeed, I have the same problem on my PR.
We have only 22 concurrently running jobs right now, two of them being Windows builds and most of the rest being on pyrolyse or roquableu...
When there is a GitLab shared runner shortage, we only get 1 concurrent shared runner for our project.
The wrathful gods of software engineering have decided against the 8.12 branch
Usually this kind of shortage is solved at night though.
It's never been like in the Travis times when we could have a queue of 72 hours.
Yeah we have quite a shortage; we should definitively budget some newer server for our CI; a dual threadripper 3950x with 1Tb of RAM could do wonders for our buildds
Also fiat-crypto is quite out of hand, taking almost 3 hours to complete.
and all the async / quick jobs in general being very slow and still unreliable
Before budgeting new servers, I think we should work at reducing the size of our CI. In particular, it is not too late to come back to coq/coq#6724. If we implement this, then we can adopt a bors-like merging workflow, and if we do this, then we can set a lighter, partly on-demand CI for PRs.
Not sure that will work so well for us; as compared to a simpler solution with what we have [for example coqbot running the full CI only when the PR is ready to merge due to tags , etc...]
#6724 is not actually blocking having a bors-like setup, you can have it with the current setup
but in both cases, if the merge fails, you are quite in trouble.
We often want to trigger a CI before PRs are finalized, so better just do something on-demand than rely on labels.
a repush won't run the tests you care about, your local test may miss something, and another merge attempt is equally expensive
and if you go with a lockfile, you are gonna have to worry about consistency as extra, this is not a trivial overhead
but indeed we could be lighter right now, I wonder by how much tho
as the amount of stuff you care about really varies among PRs, we could use the newer path-based selection rules in CI
as proposed for the docs
I don't see how you can have a bors-like workflow with backward-incompatible overlays in the current setup. Please explain.
same way that you would do when implementing the lockfile
anyways you still need a super beefy server
even more beefy than now
Why?
at least today, we don't suffer the bottleneck when merging
imagine days like today, or in CUDW
with dozens of merge queued
Sure, but bors is supposed to optimize the merging of many PRs.
also quite a few failing
yeah but in our case it'd need tremendous CI power
I don't think so.
well my numbers are otherwise
how much CPU time do we need to merge 20 PRs?
Where do you get your numbers from?
From current CI
that's easy to calculate
how much CPU time do we need to merge 20 PRs?
Same as for 1 PR.
Same a for 1 PR?
how come?
Yes, because if you have several PRs that are merged in a close window, you run just one test.
That's how bors works.
That's also how RMs manage the backporting queue generally.
what happens if the first PR on the chain makes the merge fail?
Well, in this case, you have to determine which PR is the culprit and restart the merge of the others.
So there's a need for some not-so-simple logic.
bors does bisecting indeed, but that may needs tons of power in busy days
or have devs waiting for long IMO
Still doing numbers for a busy day with 20 PRs a day + 10% prob of failing + 10% prob of consistency issues and the default bisect strategy
you need a worker able to swallow a lot
Bisecting still optimizes the use of the power, but smart human beings can optimize further by looking into the error and guessing where it comes from.
10% prob of consistency issues
That's the reason why I'm saying a bors-like workflow is not possible in the current setup for backward-incompatible overlays.
10% prob of failing
And this can be made lower by relying on humans guessing which PRs need additional CI testing before attempting a merge.
consistency issues do arise in the lockfile setup
in fact they are more noticeable
the more the delta for a distributed setup, the weaker the consistency
Do you mean when lockfiles are updated?
I'm not a big fan of hybrids CI setups
WDYM by this? Having a different set of tests on branches and PRs?
Because that's the only way to reduce the size of the CI, I'm afraid.
Théo Zimmermann said:
Do you mean when lockfiles are updated?
yes, you need to update them often, otherwise you are gonna go stale soon, and the more you with an unnoticed consistency break , the longer it usually takes to repair [that's an hypothesis tho]
I share this intuition.
I prefer to add more power to the CI, than to reduce the size, but indeed some optimizations could be done
I have some environmental concerns...
the current system is painful, but at least it has pretty good consistency properties; the lockfiles + bors is fine tho; coqbot could take care of bumping etc...
I share this intuition.
But my feeling is that bumping once a week is a good compromise
coqbot could take care of bumping etc...
Right, that was my plan
environmental concerns are indeed a problem, also hard to measure accurately ; human time also has impact
Actually the kind of server I proposed is much much more efficient in terms of energy than what we are using now
So maybe we should address both.
it has by far the best power / Watt ratio in the whole server marketplace
I didn't know that this could be optimized. Well, then maybe we should go ahead and buy it. I have some budget available that I could spend on this. Maybe not during the sanitary crisis though.
it is true that we should attempt for the server not to be too idle
but even now idle servers thanks to ssd + newer cpus don't spend a lot of energy
Théo Zimmermann said:
I didn't know that this could be optimized. Well, then maybe we should go ahead and buy it. I have some budget available that I could spend on this. Maybe not during the sanitary crisis though.
Oh sure, the increase in energy efficiency due to newer node process + better power design is dramatic; I wouldn't be surprised if for example a top of the line ryzen outperforms like 10x pyrolise in terms of energy per computing unit.
what to remove from the CI is not so clear tho ; plugins you mostly need, contribs the same; you are left with jobs regarding multiple ocaml versions and test suites
and validate
that's some saving, but not enough
on the other hand indeed we could just not run the contribs
just the cheap set, and delay that to merge
indeed these failures are rare
Plugins yes, you need to test always, but they build fast. Libraries (what you call contribs I guess), you could test less often: only when there seems to be a risk of incompatibility. Then, the idea is that every time you get a failure that you didn't catch, that's a test-suite bug.
top of the line epyc sorry
unfortunately the kind of server I'm talking about is very efficient, but the price tag is well above 20K O_O
well there are cheaper options
but I was talking about 128 cores
256 threads
quite a beefy machine
huhu gitlab just crashed
ERROR: Preparation failed: Error: No such image: registry.gitlab.com/coq/coq:bionic_coq-V2020-05-19-V33 (docker.go:203:0s)
What's up with this?
where are you seeing this?
Gaëtan Gilbert said:
where are you seeing this?
I was seeing this on the CI for https://github.com/coq/coq/pull/11029 before I rebased it. Now that I've rebased, I'm seeing https://github.com/coq/coq/issues/12400 on fiat-crypto and also Fatal error: ref_table overflow
on engine-bench, which I'm about to report
Reported as https://github.com/coq/coq/issues/12401. But what does "ref_table overflow" even mean?
Gaëtan Gilbert said:
where are you seeing this?
I just hit this again, and reported it as https://github.com/coq/coq/issues/12402 and https://gitlab.com/coq/coq/-/jobs/565521645
Jason Gross said:
Reported as https://github.com/coq/coq/issues/12401. But what does "ref_table overflow" even mean?
https://github.com/ocaml/ocaml/blob/b2f467abc0d2caaf40b6446a567f968ce935247f/runtime/minor_gc.c#L621
Quickchick looks broken, what's going on?
Isn't it coq/coq#12441?
I didn't see that somehow, thanks.
Why is MetaCoq broken on the CI? I don't see any commit that could have affected it upstream?
Maybe change in equations? There was https://github.com/mattam82/Coq-Equations/compare/74f048116242d7fb13cbc554522f02fe270720d..723f17f3f19302a2d6fb99cfb4983c9b465a9618 between https://gitlab.com/coq/coq/-/jobs/593210622 and https://gitlab.com/coq/coq/-/jobs/593573820 (last success pipe and first fail pipe)
Summoning @Matthieu Sozeau
@Enrico Tassi I'm curious to know why you triggered a WINDOWS_ONLY
/ enabled_all_addons
job on master. FTR there are a few known issues: dynlink errors when building with OCaml >= 4.08, some add-ons that are not in CI are not likely to build on master...
When did I?
I don't recall doing this, I can't find a reason. Is my account begin hacked?
Is the "cron" job I started a month still on?
You make me worry
That might be the cron job indeed!
It might be better to deactivate it for the time being since we know it won't build.
OK, I trashed it.
Oh right, I forgot MetaCoq was following equations master. I'll fix that ASAP
Just saw this failed job where git fetch --unshallow
failed, likely because some branch was removed between the time the repo was cloned and the time when the command was run. Is there any way of making git fetch --unshallow
more robust, e.g. by unshallowing only the current branch?
https://github.com/coq/coq/pull/9594#discussion_r261697603
Why not override GIT_DEPTH?
That's what I wanted to do at first but nowhere in the documentation it says what value to put if you want no limit (I suppose empty would work but this is unspecified).
let's just use empty GIT_DEPTH
It still isn't documented what it does when you put it to an empty value or zero (whether it is blindly substituted or not) so we should test what it does first.
@Jason Gross It seems that coq-tools are consistently failing and the first failure in master was when a PR which had a fully green CI, so I suspect the problem is coming from a change on your side.
https://github.com/JasonGross/coq-tools/commits/master
Oops, sorry, I forgot to test throughly before pushing, should be fixed now
Something has changed (not on the Coq side) which has broken Windows CI both on Azure (opam-based) and on GitLab CI (manual build of OCaml). See https://dev.azure.com/coq/coq/_build/results?buildId=11965&view=logs&jobId=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=f33089cd-d3fe-578c-2c5e-4ee560e10a0d and https://coq.gitlab.io/-/coq/-/jobs/748531263/artifacts/artifacts/buildlogs/ocaml-4.08.1-make-world.opt_err.txt.
(I've verified that the first failure on master happened when merging a PR whose CI was successful.)
@Théo Zimmermann : I think cygwin recently updated the C compiler to a new major version - this could easily have all sorts of interesting effects. I think I can pin it to the older version, but this will not work forever since cygwin typically keeps only the most recent two or 3 versions, although for gcc there might be more.
@Théo Zimmermann : actually it was the MinGW headers. I meant this announcement: (https://cygwin.com/pipermail/cygwin/2020-September/246341.html)
Interesting thanks! That sounds like a very plausible explanation. Not sure how I should understand your second message though (headers sound like a much less major change).
I just remembered that I read something about MinGW changes around the compiler but didn't remember the details. The mingw-gcc doesn't seem to have recent changes (https://cygwin.com/packages/summary/mingw64-x86_64-gcc-core.html). Still this static stuff failing might depend on some C library defined macros, so one can't exclude that it was this. The headers did change recently to a new major version, see (https://cygwin.com/packages/summary/mingw64-x86_64-headers.html). If this is time wise coincident, I would say this is a likely root cause.
Continued here: https://github.com/coq/coq/issues/13065
Is it normal for color to be broken on master?
:(
dpdgraph seems broken also, I hope it's not due to the PR I've submitted yesterday
(it was working properly locally, and I don't understand the logs enough to be certain of the root of the failure)
I confirm I broke dpdgraph alone :boom:
... but I swear I am not responsible for color TTBMK
for some reason now https://gitlab.mpi-sws.org/iris/stdpp/-/archive/.tar.gz is a 404 error, breaking the iris CI
ah, this is because the git ref is not correctly set
Théo Zimmermann said:
Yes, because if you have several PRs that are merged in a close window, you run just one test.
Hi, I've just seen this thread (just skimmed, sorry) and I wanted to ask if you were aware of the following GitLab CI feature?
https://docs.gitlab.com/ee/ci/merge_request_pipelines/pipelines_for_merged_results/merge_trains/
− anyway I guess this is probably not useful, as it specifically deals with GitLab MRs, not GitHub PRs… and IIUC, bors provides a similar feature? (I'm not a bors user), although maybe not using the very same algo…
Looks like a very old quote! Yes, there are many implementations of "merge-trains" "bors-like" features, but what is blocking us from using this system is that our CI is too unreliable, especially with the way we deal with backward-incompatible overlays.
So Iris seems reliably broken on the recent CI pipelines
see e.g. https://gitlab.com/coq/coq/-/jobs/779420849 or https://gitlab.com/coq/coq/-/jobs/778337747
Stdpp seems to have changed its build system according to one commit message, which might explain that
Indeed, so now we have to look for coq-iris.opam
instead of opam
.
I think @Ralf Jung changed something very recently.
Yep, the opam file got moved, so I wrote a fix
Oh so sorry, I meant to update the Coq side of it but then forgot :(
I can submit a PR if that helps
Nah, thanks, it should be solved by https://github.com/coq/coq/pull/13158 hopefully
ah but iris-examples will also change to match the new style, so I prepared https://github.com/coq/coq/pull/13159
feel free to just take that patch into your PR if that is easier
Right, so we have to be synchronized with the upstream merge
When are you planning to perform it?
I already did
so my PR should pass CI now
Indeed, I was just reading https://gitlab.com/coq/coq/-/jobs/779609904 :face_palm:
it got cancelled due to some limit or so it seems
Right, I'll close my PR then.
Uh, CI is not starting on https://github.com/coq/coq/pull/13168. Is that normal?
... despite two force pushes.
Coqbot is dead, long live to coqbot?
Answering to myself: github is having problems with webhooks it seems according to https://www.githubstatus.com/
Travis is dead as well, this is a systemic issue
my latest commit to a coq-community project is not even reflected via pull request web interface
@Guillaume Melquiond Something surprising happened. CompCert started failing suddenly in Coq's CI in both the CompCert job and the VST job (which includes a bundled version of CompCert). This seems unrelated to any change in Coq since it failed suddenly in all builds with no related change. However, since CompCert and its VST copy are not simultaneously updated, I would have expected to see a change in the master
branch of Flocq that would explain the failure, but the last commit shown at https://gitlab.inria.fr/flocq/flocq/-/tree/master is from 2 month ago.
See https://gitlab.com/coq/coq/-/jobs/936644456 for the failing log.
That is expected. CompCert is not compatible with Flocq's master. (Actually, not much will be compatible with Flocq's master, since it paves the way to Flocq 4.) CompCert's developers have been aware of it since early September: https://github.com/AbsInt/CompCert/pull/368. But there is nothing they can do about it for now. Anyway, this does not matter much for CompCert, since it vendors Flocq.
Thanks for your quick answer @Guillaume Melquiond, unfortunately I didn't notice until now.
But why did it start failing just now despite the last commit on Flocq's master being two months old?
And in Coq's CI we were not using the vendored version to avoid building Flocq three times. Do you have a release branch for Flocq 3 that you maintain and that we could use in Coq's CI instead of master
?
I suggest you crate a coq-master branch we track in ci and in which you merge master manually, from time to time, when it fits (eg concert catches up). I do this way for elpi (I develop against Coq stable for example).
That is a conjunction of two things. First, Coq's CI stopped using CompCert's vendored Flocq only recently (in preparation of the 8.13 release). Second, I had temporarily stashed the main commits out of Flocq's master to prepare Flocq 3.4 (in preparation of the 8.13 release). Thus, the issue with Coq's CI went unnoticed because both things happened around the same time.
Just to be clear, it might take several years before CompCert catches up to Flocq's master
. (First, Flocq 4 needs to be released. Second, CompCert needs to decide to drop Flocq 3 in favor of Flocq 4.) So, a coq-master
branch does not seem like a good idea (or I misunderstood it). But you can decide to track the flocq-3
branch instead of master
. That said, this is only delaying the issue. Because at some point, CompCert's master
and CoqInterval's master
will disagree (even if only temporarily) on the major version of Flocq they are compatible with.
CompCert's master and CoqInterval's master will disagree (even if only temporarily) on the major version of Flocq they are compatible with.
This is a serious issue for people working on verification of floating point C code. In the Coq Platform I think we have to pick the latest combination of versions which share the same version of Flocq until this is solved. I thought about if it makes sense to have a separate package for flocq-4 during the transition period, but after thinking it through I came to the conclusion that this is better solved with a separate opam switch for Flocq 3 and Flocq 4 based tools.
I will switch Coq Platform master to track flocq-3 and see what happens.
it might take several years
We can maybe add 2 jobs for flocq, flocq-3 used by compcert, and flocq-4 (to be used at some point), with the plan to remove flocq-3 eventually.
@Enrico Tassi : as I said I also thought about this, but rejected it because it would require a lot of additional -3 and -4 packages, say for CoqInterval. My conclusion is that it would be better to have a separate switch - say an option in the Coq Platform like Flocq3 with full set of packages or Flocq4 with reduced set of packages. But it is more a 70%/30% pro/con thing.
But I guess it would make sense to have an opam package flocq.dev.3
or flocq.3.dev
. Which one?
I was talking about Coq's CI, not the platform.
It is customary to have .$branch.dev
so I'd pick the latter (in Coq it's v8.X.dev
IIRC)
Coq's ci (for the good and the bad) is not based on opam, so a job is a bunch of commands to fetch and build a package. I'd pick only one version of each dependency under CI, moving them on top of the flocq-4 once all ready.
Let's test only flocq-3
for now. Testing two branches would be a bit inconvenient for overlay creation.
@Guillaume Melquiond If you could give us a heads-up when you switch CoqInterval master
to requiring Flocq 4, then we could introduce the two jobs at that point, or switch CoqInterval to a stable branch if you maintain one at that point.
elpi has been broken for 2 days, what is going on?
https://gitlab.com/coq/coq/-/pipelines?page=1&scope=finished&ref=master
@Enrico Tassi mentioned he was going to fix this today in another thread
It should be fixed now. The job reported by @Gaëtan Gilbert https://github.com/LPCIC/coq-elpi/issues/354 was successfully re-run
Current issue with itauto is waiting for this overlay to be merged: https://gitlab.inria.fr/fbesson/itauto/-/merge_requests/5
@Maxime Dénès Looks like jasmin is failing on CI, but I cannot understand why. Upstream on GitLab there are no failures: https://gitlab.com/jasmin-lang/jasmin
Thanks, I'll have a look
failing because of the ident/name change in https://github.com/coq/coq/pull/15754
Ah thats a shame, I should have restarted the CI before merging
I'll submit a patch to Jasmin
Here is the patch https://github.com/jasmin-lang/jasmin/pull/98
Fiat crypto is broken but that has been discussed. Unimath seems to be running over 3 hrs .
it seems to be hovering around the limit, in some runs it passes and in some others it fails
Unimath dies because an OOM rather than the 3h limit, it seems.
Command terminated by signal 9
UniMath/Bicategories/PseudoFunctors/Preservation/BiadjunctionPreserveInserters.vo (real: 1134.77, user: 440.04, sys: 8.37, mem: 2937224 ko)
build/CoqMakefile.make:790: recipe for target 'UniMath/Bicategories/PseudoFunctors/Preservation/BiadjunctionPreserveInserters.vo' failed
In the CI script we already unplug one suspiciously similar looking file from the compilation process
# DisplayedInserter consumes too much memory for the shared workers
sed -i.bak 's|DisplayedBicats/Examples/DisplayedInserter.v||' UniMath/Bicategories/.package/files
maybe we should do the same (or ask the unimath people to try to keep memory consumption at bay)
I don't see anything suspicious in that file, it's just that this is a computation-heavy dev that consumes ~4GiB of memory on my machine
is there an easy way to bump up the available amount of memory for the CI runners?
or should we just deactivate this file as we did for the other in the script (this seems to be dead code btw)
Shall we deactivate some files then?
I think we should ask the upstream to provide a target called, say, coq-ci, which tests relevant parts and runs is 1h time.
I've been staring at CI for like forever recently, we can't have a stack of jobs taking hours
I mean, it should be part of the CI requirements
From a CI point of view, unimath is both a very interesting dev and a terrible one. It's interesting because it stresses a lot Coq, and it's terrible precisely because of that.
I suppose unimath could be split into many smaller opam packages under a virtual package. This would allow us to test it much faster.
@Jason Gross fiat-crypto is now broken upstream, it seems (e.g. https://gitlab.com/coq/coq/-/jobs/2382855355)
Looks like there was a recent direct push to master https://github.com/mit-plv/fiat-crypto
Splitting into parallel packages is something orthogonal to our CI. We should describe requirements, not how to satisfy them, IMO.
About full unimath, I guess you want the bench to fully run it. But if your PR is not about performance (e.g. cleaning up an API) you simply gain nothing in running it fully. And you may still run a bench, but once.
Now VST is broken https://gitlab.com/coq/coq/-/jobs/2383881008
Enrico Tassi said:
Splitting into parallel packages is something orthogonal to our CI. We should describe requirements, not how to satisfy them, IMO.
About full unimath, I guess you want the bench to fully run it. But if your PR is not about performance (e.g. cleaning up an API) you simply gain nothing in running it fully. And you may still run a bench, but once.
I suppose we could update the policy to add a time limit. However jobs like fiat-crypto may be long but are useful to test in the CI since they have a lot of coverage. Do we play favorites with big jobs?
That is also possible to split into smaller packages, which we could ask the fiat-crypto developers to help us set up.
bedrock, vst, fiat-crypto-legacy take 1:30h to finish, unimath and fiat-crypto take 2:30h
We could ask that CI projects keep their time under 2hrs and suggest splitting into smaller projects to allow for this.
But I think we need to be in agreement how long it should take for the CI to run. For me 3hrs per job is kind of OK, but I would be interested to hear other peoples opinion.
3h per job means that you can push/fix a PR once a day, at most. The first time you discover the failure, then you burn 2h CPU on your computer, you fix you push the overlay. Ditto when you "just rebase" a PR, or just fix the doc. This means anyone taking care of your PR has to act the next day if he waits for CI to be green (imagine you push at 3PM). IMO, it is totally out of control.
and now VST is broken as well: https://gitlab.com/coq/coq/-/jobs/2383649606
The first time you discover the failure, then you burn 2h CPU on your computer, you fix you push the overlay
@Enrico Tassi this was true before the introduction of the minimizer, but not anymore.
with the minimizer you burn 24h cpu on microsoft computers?
Right but once you have the minimized file, making a patch is way easier than running 2hrs every time.
^this
the minimizer is nice, but does not cover the case of an overlay. only of a bug in your new code.
If I read the VST error correctly, we now need to install Flocq3 in the CI.
Is there a way the minimizer could be made to help cover overlays? For example, the minimizer does have a mode where it doesn't admit lemmas. Or is the hard part of overlays that the first failure might not be the only one?
compcert now broken, I guess due to https://github.com/AbsInt/CompCert/pull/368 (flocq4 stuff)
why does everybody always synchronize right before the branch to break everything in the CI?
it's called "deadline approaching"
so what do we do with unimath?
Last updated: Oct 13 2024 at 01:02 UTC