CI problems · Coq devs & plugin devs

Stream: Coq devs & plugin devs

Topic: CI problems

Jason Gross (May 11 2020 at 05:25):

The test-suite is failing frequently (about 10%) of the time on Mac OS. The failure I'm seeing in at least one of these cases is

File "./modules/PO.v", line 51, characters 0-12:
Error:
/Users/runner/runners/2.166.4/work/1/s/test-suite/modules/Nat.vo: premature end of file. Try to rebuild it.

0m0.001s 0m0.001s
0m0.084s 0m0.050s
==========> FAILURE <==========
    modules/PO.v...Error! (should be accepted)

FAILURES
    modules/PO.v...Error! (should be accepted)

What's up with this?

Jason Gross (May 11 2020 at 05:25):

https://dev.azure.com/coq/coq/_build/results?buildId=9655&view=logs&j=a5e52b91-c83f-5429-4a68-c246fc63a4f7&t=bde16fc7-7659-5b1c-12aa-0110866d993c

Jason Gross (May 11 2020 at 05:25):

https://github.com/coq/coq/pull/12129/checks?check_run_id=659402354

Hugo Herbelin (May 11 2020 at 14:29):

I have this from time to time too (as well as for another modules test) but did not investigated. If you tell it is MacOS X specific then I could pay more attention to it when it arrives.

Hugo Herbelin (May 11 2020 at 14:52):

Any hints on what to think about this refman CI failure:

Latexmk: calling xindy( -L english -C utf8 -M sphinx.xdy -I xelatex -o CoqRefMan.ind CoqRefMan.idx )
Latexmk: applying rule 'makeindex CoqRefMan.idx'...
Warning: reserving address range 0x7ff2855cd000...0x8020a1572fff that contains memory mappings. clisp might crash later!

Théo Zimmermann (May 11 2020 at 14:57):

Is the runner pyrolyse? See coq/coq#12280.

Jason Gross (May 12 2020 at 00:30):

@Hugo Herbelin I'm not sure if it's Mac OS specific, but this is the first I've seen it, and it was on Mac OS.

Antonio Nikishaev (May 13 2020 at 20:20):

I noticed (while working on coq/coq#11992) that there's a crash in fiat_parsers CI: "File "interp/constrextern.ml", line 844, characters 9-15: Assertion failed.", even when CI is passing (don't ask me, no idea :) — e.g. grep for Anomaly in https://travis-ci.org/github/mit-plv/fiat/jobs/683994964 (web ui) / https://api.travis-ci.org/v3/job/683994964/log.txt (raw text). Is that a known Coq bug? /cc @Jason Gross

Jason Gross (May 13 2020 at 20:36):

Very strange. This is because I generally assume that if make fails, then make || make will also fail, but here it does not. Let me fix that on the CI.

Jason Gross (May 13 2020 at 20:36):

In any case, this seems to be a new bug in Coq, where somehow the anomaly occurs the first time you coq the file, but not the second time

Jason Gross (May 13 2020 at 20:46):

Hm, this is actually pretty tricky to fix on my end, due to how my automation is set up. I'm also not sure that it's the job of my own CI to catch heisenanomalies in Coq... (Note that the build succeeds on Coq's CI)

Pierre-Marie Pédrot (May 14 2020 at 12:32):

Is there a runner shortage right now? I have a PR that doesn't start any CI run after quite a while, except for windows tests .

Théo Zimmermann (May 14 2020 at 12:48):

@Pierre-Marie Pédrot You mean this is not the usual GitLab CI pipeline exceeded error?

Théo Zimmermann (May 14 2020 at 12:50):

Oh indeed, I have the same problem on my PR.

Théo Zimmermann (May 14 2020 at 12:52):

We have only 22 concurrently running jobs right now, two of them being Windows builds and most of the rest being on pyrolyse or roquableu...

Théo Zimmermann (May 14 2020 at 13:00):

When there is a GitLab shared runner shortage, we only get 1 concurrent shared runner for our project.

Pierre-Marie Pédrot (May 14 2020 at 13:02):

The wrathful gods of software engineering have decided against the 8.12 branch

Théo Zimmermann (May 14 2020 at 13:02):

Usually this kind of shortage is solved at night though.

Théo Zimmermann (May 14 2020 at 13:03):

It's never been like in the Travis times when we could have a queue of 72 hours.

Emilio Jesús Gallego Arias (May 14 2020 at 16:57):

Yeah we have quite a shortage; we should definitively budget some newer server for our CI; a dual threadripper 3950x with 1Tb of RAM could do wonders for our buildds

Emilio Jesús Gallego Arias (May 14 2020 at 16:58):

Also fiat-crypto is quite out of hand, taking almost 3 hours to complete.

Emilio Jesús Gallego Arias (May 14 2020 at 16:58):

and all the async / quick jobs in general being very slow and still unreliable

Théo Zimmermann (May 14 2020 at 17:01):

Before budgeting new servers, I think we should work at reducing the size of our CI. In particular, it is not too late to come back to coq/coq#6724. If we implement this, then we can adopt a bors-like merging workflow, and if we do this, then we can set a lighter, partly on-demand CI for PRs.

Emilio Jesús Gallego Arias (May 14 2020 at 17:03):

Not sure that will work so well for us; as compared to a simpler solution with what we have [for example coqbot running the full CI only when the PR is ready to merge due to tags , etc...]

Emilio Jesús Gallego Arias (May 14 2020 at 17:03):

#6724 is not actually blocking having a bors-like setup, you can have it with the current setup

Emilio Jesús Gallego Arias (May 14 2020 at 17:03):

but in both cases, if the merge fails, you are quite in trouble.

Maxime Dénès (May 14 2020 at 17:04):

We often want to trigger a CI before PRs are finalized, so better just do something on-demand than rely on labels.

Emilio Jesús Gallego Arias (May 14 2020 at 17:04):

a repush won't run the tests you care about, your local test may miss something, and another merge attempt is equally expensive

Emilio Jesús Gallego Arias (May 14 2020 at 17:05):

and if you go with a lockfile, you are gonna have to worry about consistency as extra, this is not a trivial overhead

Emilio Jesús Gallego Arias (May 14 2020 at 17:05):

but indeed we could be lighter right now, I wonder by how much tho

Emilio Jesús Gallego Arias (May 14 2020 at 17:06):

as the amount of stuff you care about really varies among PRs, we could use the newer path-based selection rules in CI

Emilio Jesús Gallego Arias (May 14 2020 at 17:06):

as proposed for the docs

Théo Zimmermann (May 14 2020 at 17:06):

I don't see how you can have a bors-like workflow with backward-incompatible overlays in the current setup. Please explain.

Emilio Jesús Gallego Arias (May 14 2020 at 17:07):

same way that you would do when implementing the lockfile

Emilio Jesús Gallego Arias (May 14 2020 at 17:07):

anyways you still need a super beefy server

Emilio Jesús Gallego Arias (May 14 2020 at 17:07):

even more beefy than now

Théo Zimmermann (May 14 2020 at 17:07):

Why?

Emilio Jesús Gallego Arias (May 14 2020 at 17:07):

at least today, we don't suffer the bottleneck when merging

Emilio Jesús Gallego Arias (May 14 2020 at 17:08):

imagine days like today, or in CUDW

Emilio Jesús Gallego Arias (May 14 2020 at 17:08):

with dozens of merge queued

Théo Zimmermann (May 14 2020 at 17:08):

Sure, but bors is supposed to optimize the merging of many PRs.

Emilio Jesús Gallego Arias (May 14 2020 at 17:08):

also quite a few failing

Emilio Jesús Gallego Arias (May 14 2020 at 17:08):

yeah but in our case it'd need tremendous CI power

Théo Zimmermann (May 14 2020 at 17:08):

I don't think so.

Emilio Jesús Gallego Arias (May 14 2020 at 17:09):

well my numbers are otherwise

Emilio Jesús Gallego Arias (May 14 2020 at 17:09):

how much CPU time do we need to merge 20 PRs?

Théo Zimmermann (May 14 2020 at 17:09):

Where do you get your numbers from?

Emilio Jesús Gallego Arias (May 14 2020 at 17:09):

From current CI

Emilio Jesús Gallego Arias (May 14 2020 at 17:09):

that's easy to calculate

Théo Zimmermann (May 14 2020 at 17:09):

how much CPU time do we need to merge 20 PRs?

Same as for 1 PR.

Emilio Jesús Gallego Arias (May 14 2020 at 17:09):

Same a for 1 PR?

Emilio Jesús Gallego Arias (May 14 2020 at 17:09):

how come?

Théo Zimmermann (May 14 2020 at 17:10):

Yes, because if you have several PRs that are merged in a close window, you run just one test.

Théo Zimmermann (May 14 2020 at 17:10):

That's how bors works.

Théo Zimmermann (May 14 2020 at 17:10):

That's also how RMs manage the backporting queue generally.

Emilio Jesús Gallego Arias (May 14 2020 at 17:10):

what happens if the first PR on the chain makes the merge fail?

Théo Zimmermann (May 14 2020 at 17:10):

Well, in this case, you have to determine which PR is the culprit and restart the merge of the others.

Théo Zimmermann (May 14 2020 at 17:11):

So there's a need for some not-so-simple logic.

Emilio Jesús Gallego Arias (May 14 2020 at 17:12):

bors does bisecting indeed, but that may needs tons of power in busy days

Emilio Jesús Gallego Arias (May 14 2020 at 17:12):

or have devs waiting for long IMO

Emilio Jesús Gallego Arias (May 14 2020 at 17:13):

Still doing numbers for a busy day with 20 PRs a day + 10% prob of failing + 10% prob of consistency issues and the default bisect strategy

Emilio Jesús Gallego Arias (May 14 2020 at 17:13):

you need a worker able to swallow a lot

Théo Zimmermann (May 14 2020 at 17:13):

Bisecting still optimizes the use of the power, but smart human beings can optimize further by looking into the error and guessing where it comes from.

Théo Zimmermann (May 14 2020 at 17:14):

10% prob of consistency issues

That's the reason why I'm saying a bors-like workflow is not possible in the current setup for backward-incompatible overlays.

Théo Zimmermann (May 14 2020 at 17:14):

10% prob of failing

And this can be made lower by relying on humans guessing which PRs need additional CI testing before attempting a merge.

Emilio Jesús Gallego Arias (May 14 2020 at 17:14):

consistency issues do arise in the lockfile setup

Emilio Jesús Gallego Arias (May 14 2020 at 17:14):

in fact they are more noticeable

Emilio Jesús Gallego Arias (May 14 2020 at 17:15):

the more the delta for a distributed setup, the weaker the consistency

Théo Zimmermann (May 14 2020 at 17:15):

Do you mean when lockfiles are updated?

Emilio Jesús Gallego Arias (May 14 2020 at 17:15):

I'm not a big fan of hybrids CI setups

Théo Zimmermann (May 14 2020 at 17:16):

WDYM by this? Having a different set of tests on branches and PRs?

Théo Zimmermann (May 14 2020 at 17:16):

Because that's the only way to reduce the size of the CI, I'm afraid.

Emilio Jesús Gallego Arias (May 14 2020 at 17:16):

Théo Zimmermann said:

Do you mean when lockfiles are updated?

yes, you need to update them often, otherwise you are gonna go stale soon, and the more you with an unnoticed consistency break , the longer it usually takes to repair [that's an hypothesis tho]

Théo Zimmermann (May 14 2020 at 17:17):

I share this intuition.

Emilio Jesús Gallego Arias (May 14 2020 at 17:17):

I prefer to add more power to the CI, than to reduce the size, but indeed some optimizations could be done

Théo Zimmermann (May 14 2020 at 17:17):

I have some environmental concerns...

Emilio Jesús Gallego Arias (May 14 2020 at 17:18):

the current system is painful, but at least it has pretty good consistency properties; the lockfiles + bors is fine tho; coqbot could take care of bumping etc...

Théo Zimmermann (May 14 2020 at 17:18):

I share this intuition.

But my feeling is that bumping once a week is a good compromise

coqbot could take care of bumping etc...

Right, that was my plan

Emilio Jesús Gallego Arias (May 14 2020 at 17:19):

environmental concerns are indeed a problem, also hard to measure accurately ; human time also has impact

Emilio Jesús Gallego Arias (May 14 2020 at 17:19):

Actually the kind of server I proposed is much much more efficient in terms of energy than what we are using now

Théo Zimmermann (May 14 2020 at 17:20):

So maybe we should address both.

Emilio Jesús Gallego Arias (May 14 2020 at 17:20):

it has by far the best power / Watt ratio in the whole server marketplace

Théo Zimmermann (May 14 2020 at 17:21):

I didn't know that this could be optimized. Well, then maybe we should go ahead and buy it. I have some budget available that I could spend on this. Maybe not during the sanitary crisis though.

Emilio Jesús Gallego Arias (May 14 2020 at 17:21):

it is true that we should attempt for the server not to be too idle

Emilio Jesús Gallego Arias (May 14 2020 at 17:21):

but even now idle servers thanks to ssd + newer cpus don't spend a lot of energy

Emilio Jesús Gallego Arias (May 14 2020 at 17:22):

Théo Zimmermann said:

I didn't know that this could be optimized. Well, then maybe we should go ahead and buy it. I have some budget available that I could spend on this. Maybe not during the sanitary crisis though.

Oh sure, the increase in energy efficiency due to newer node process + better power design is dramatic; I wouldn't be surprised if for example a top of the line ryzen outperforms like 10x pyrolise in terms of energy per computing unit.

Emilio Jesús Gallego Arias (May 14 2020 at 17:24):

what to remove from the CI is not so clear tho ; plugins you mostly need, contribs the same; you are left with jobs regarding multiple ocaml versions and test suites

Emilio Jesús Gallego Arias (May 14 2020 at 17:24):

and validate

Emilio Jesús Gallego Arias (May 14 2020 at 17:24):

that's some saving, but not enough

Emilio Jesús Gallego Arias (May 14 2020 at 17:25):

on the other hand indeed we could just not run the contribs

Emilio Jesús Gallego Arias (May 14 2020 at 17:25):

just the cheap set, and delay that to merge

Emilio Jesús Gallego Arias (May 14 2020 at 17:25):

indeed these failures are rare

Théo Zimmermann (May 14 2020 at 17:25):

Plugins yes, you need to test always, but they build fast. Libraries (what you call contribs I guess), you could test less often: only when there seems to be a risk of incompatibility. Then, the idea is that every time you get a failure that you didn't catch, that's a test-suite bug.

Emilio Jesús Gallego Arias (May 14 2020 at 17:28):

top of the line epyc sorry

Emilio Jesús Gallego Arias (May 14 2020 at 17:28):

unfortunately the kind of server I'm talking about is very efficient, but the price tag is well above 20K O_O

Emilio Jesús Gallego Arias (May 14 2020 at 17:28):

well there are cheaper options

Emilio Jesús Gallego Arias (May 14 2020 at 17:28):

but I was talking about 128 cores

Emilio Jesús Gallego Arias (May 14 2020 at 17:29):

256 threads

Emilio Jesús Gallego Arias (May 14 2020 at 17:29):

quite a beefy machine

Emilio Jesús Gallego Arias (May 14 2020 at 17:32):

huhu gitlab just crashed

Jason Gross (May 24 2020 at 17:15):

ERROR: Preparation failed: Error: No such image: registry.gitlab.com/coq/coq:bionic_coq-V2020-05-19-V33 (docker.go:203:0s)

What's up with this?

Gaëtan Gilbert (May 24 2020 at 17:26):

where are you seeing this?

Jason Gross (May 24 2020 at 21:16):

Gaëtan Gilbert said:

where are you seeing this?

I was seeing this on the CI for https://github.com/coq/coq/pull/11029 before I rebased it. Now that I've rebased, I'm seeing https://github.com/coq/coq/issues/12400 on fiat-crypto and also Fatal error: ref_table overflowon engine-bench, which I'm about to report

Jason Gross (May 24 2020 at 21:28):

Reported as https://github.com/coq/coq/issues/12401. But what does "ref_table overflow" even mean?

Jason Gross (May 24 2020 at 21:41):

Gaëtan Gilbert said:

where are you seeing this?

I just hit this again, and reported it as https://github.com/coq/coq/issues/12402 and https://gitlab.com/coq/coq/-/jobs/565521645

Enrico Tassi (May 25 2020 at 07:14):

Jason Gross said:

Reported as https://github.com/coq/coq/issues/12401. But what does "ref_table overflow" even mean?

https://github.com/ocaml/ocaml/blob/b2f467abc0d2caaf40b6446a567f968ce935247f/runtime/minor_gc.c#L621

Gaëtan Gilbert (Jun 03 2020 at 17:33):

Quickchick looks broken, what's going on?

Théo Zimmermann (Jun 03 2020 at 19:02):

Isn't it coq/coq#12441?

Gaëtan Gilbert (Jun 03 2020 at 19:09):

I didn't see that somehow, thanks.

Pierre-Marie Pédrot (Jun 13 2020 at 10:17):

Why is MetaCoq broken on the CI? I don't see any commit that could have affected it upstream?

Gaëtan Gilbert (Jun 13 2020 at 10:40):

Maybe change in equations? There was https://github.com/mattam82/Coq-Equations/compare/74f048116242d7fb13cbc554522f02fe270720d..723f17f3f19302a2d6fb99cfb4983c9b465a9618 between https://gitlab.com/coq/coq/-/jobs/593210622 and https://gitlab.com/coq/coq/-/jobs/593573820 (last success pipe and first fail pipe)

Pierre-Marie Pédrot (Jun 13 2020 at 14:15):

Summoning @Matthieu Sozeau

Théo Zimmermann (Jun 14 2020 at 08:38):

@Enrico Tassi I'm curious to know why you triggered a WINDOWS_ONLY / enabled_all_addons job on master. FTR there are a few known issues: dynlink errors when building with OCaml >= 4.08, some add-ons that are not in CI are not likely to build on master...

Enrico Tassi (Jun 14 2020 at 17:29):

When did I?

Enrico Tassi (Jun 14 2020 at 17:29):

I don't recall doing this, I can't find a reason. Is my account begin hacked?

Enrico Tassi (Jun 14 2020 at 17:30):

Is the "cron" job I started a month still on?

Enrico Tassi (Jun 14 2020 at 17:30):

You make me worry

Théo Zimmermann (Jun 14 2020 at 21:27):

That might be the cron job indeed!

Théo Zimmermann (Jun 14 2020 at 21:27):

It might be better to deactivate it for the time being since we know it won't build.

Enrico Tassi (Jun 15 2020 at 07:02):

OK, I trashed it.

Matthieu Sozeau (Jun 15 2020 at 09:39):

Oh right, I forgot MetaCoq was following equations master. I'll fix that ASAP

Théo Zimmermann (Jul 09 2020 at 10:25):

Just saw this failed job where git fetch --unshallow failed, likely because some branch was removed between the time the repo was cloned and the time when the command was run. Is there any way of making git fetch --unshallow more robust, e.g. by unshallowing only the current branch?

Gaëtan Gilbert (Jul 09 2020 at 10:28):

https://github.com/coq/coq/pull/9594#discussion_r261697603

Why not override GIT_DEPTH?

That's what I wanted to do at first but nowhere in the documentation it says what value to put if you want no limit (I suppose empty would work but this is unspecified).

let's just use empty GIT_DEPTH

Théo Zimmermann (Jul 09 2020 at 12:43):

It still isn't documented what it does when you put it to an empty value or zero (whether it is blindly substituted or not) so we should test what it does first.

Théo Zimmermann (Sep 16 2020 at 13:45):

@Jason Gross It seems that coq-tools are consistently failing and the first failure in master was when a PR which had a fully green CI, so I suspect the problem is coming from a change on your side.

Ali Caglayan (Sep 16 2020 at 14:40):

https://github.com/JasonGross/coq-tools/commits/master

Jason Gross (Sep 16 2020 at 19:05):

Oops, sorry, I forgot to test throughly before pushing, should be fixed now

Théo Zimmermann (Sep 21 2020 at 13:14):

Something has changed (not on the Coq side) which has broken Windows CI both on Azure (opam-based) and on GitLab CI (manual build of OCaml). See https://dev.azure.com/coq/coq/_build/results?buildId=11965&view=logs&jobId=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=f33089cd-d3fe-578c-2c5e-4ee560e10a0d and https://coq.gitlab.io/-/coq/-/jobs/748531263/artifacts/artifacts/buildlogs/ocaml-4.08.1-make-world.opt_err.txt.

Théo Zimmermann (Sep 21 2020 at 13:15):

(I've verified that the first failure on master happened when merging a PR whose CI was successful.)

Michael Soegtrop (Sep 21 2020 at 14:01):

@Théo Zimmermann : I think cygwin recently updated the C compiler to a new major version - this could easily have all sorts of interesting effects. I think I can pin it to the older version, but this will not work forever since cygwin typically keeps only the most recent two or 3 versions, although for gcc there might be more.

Michael Soegtrop (Sep 21 2020 at 14:03):

@Théo Zimmermann : actually it was the MinGW headers. I meant this announcement: (https://cygwin.com/pipermail/cygwin/2020-September/246341.html)

Théo Zimmermann (Sep 21 2020 at 14:10):

Interesting thanks! That sounds like a very plausible explanation. Not sure how I should understand your second message though (headers sound like a much less major change).

Michael Soegtrop (Sep 21 2020 at 14:31):

I just remembered that I read something about MinGW changes around the compiler but didn't remember the details. The mingw-gcc doesn't seem to have recent changes (https://cygwin.com/packages/summary/mingw64-x86_64-gcc-core.html). Still this static stuff failing might depend on some C library defined macros, so one can't exclude that it was this. The headers did change recently to a new major version, see (https://cygwin.com/packages/summary/mingw64-x86_64-headers.html). If this is time wise coincident, I would say this is a likely root cause.

Théo Zimmermann (Sep 22 2020 at 09:35):

Continued here: https://github.com/coq/coq/issues/13065

Pierre-Marie Pédrot (Sep 24 2020 at 11:46):

Is it normal for color to be broken on master?

Gaëtan Gilbert (Sep 24 2020 at 11:50):

Pierre-Marie Pédrot (Sep 24 2020 at 11:53):

dpdgraph seems broken also, I hope it's not due to the PR I've submitted yesterday

Pierre-Marie Pédrot (Sep 24 2020 at 11:54):

(it was working properly locally, and I don't understand the logs enough to be certain of the root of the failure)

Pierre-Marie Pédrot (Sep 24 2020 at 13:10):

I confirm I broke dpdgraph alone :boom:

Pierre-Marie Pédrot (Sep 24 2020 at 13:43):

... but I swear I am not responsible for color TTBMK

Pierre-Marie Pédrot (Oct 07 2020 at 17:28):

for some reason now https://gitlab.mpi-sws.org/iris/stdpp/-/archive/.tar.gz is a 404 error, breaking the iris CI

Pierre-Marie Pédrot (Oct 07 2020 at 17:32):

ah, this is because the git ref is not correctly set

Erik Martin-Dorel (Oct 07 2020 at 20:48):

Théo Zimmermann said:

Yes, because if you have several PRs that are merged in a close window, you run just one test.

Hi, I've just seen this thread (just skimmed, sorry) and I wanted to ask if you were aware of the following GitLab CI feature?
https://docs.gitlab.com/ee/ci/merge_request_pipelines/pipelines_for_merged_results/merge_trains/
− anyway I guess this is probably not useful, as it specifically deals with GitLab MRs, not GitHub PRs… and IIUC, bors provides a similar feature? (I'm not a bors user), although maybe not using the very same algo…

Théo Zimmermann (Oct 07 2020 at 21:32):

Looks like a very old quote! Yes, there are many implementations of "merge-trains" "bors-like" features, but what is blocking us from using this system is that our CI is too unreliable, especially with the way we deal with backward-incompatible overlays.

Pierre-Marie Pédrot (Oct 08 2020 at 09:23):

So Iris seems reliably broken on the recent CI pipelines

Pierre-Marie Pédrot (Oct 08 2020 at 09:24):

see e.g. https://gitlab.com/coq/coq/-/jobs/779420849 or https://gitlab.com/coq/coq/-/jobs/778337747

Pierre-Marie Pédrot (Oct 08 2020 at 09:25):

Stdpp seems to have changed its build system according to one commit message, which might explain that

Pierre-Marie Pédrot (Oct 08 2020 at 09:30):

Indeed, so now we have to look for coq-iris.opam instead of opam.

Janno (Oct 08 2020 at 09:50):

I think @Ralf Jung changed something very recently.

Pierre-Marie Pédrot (Oct 08 2020 at 09:51):

Yep, the opam file got moved, so I wrote a fix

Ralf Jung (Oct 08 2020 at 09:53):

Oh so sorry, I meant to update the Coq side of it but then forgot :(

Ralf Jung (Oct 08 2020 at 09:53):

I can submit a PR if that helps

Pierre-Marie Pédrot (Oct 08 2020 at 09:54):

Nah, thanks, it should be solved by https://github.com/coq/coq/pull/13158 hopefully

Ralf Jung (Oct 08 2020 at 09:57):

ah but iris-examples will also change to match the new style, so I prepared https://github.com/coq/coq/pull/13159

Ralf Jung (Oct 08 2020 at 09:58):

feel free to just take that patch into your PR if that is easier

Pierre-Marie Pédrot (Oct 08 2020 at 10:01):

Right, so we have to be synchronized with the upstream merge

Pierre-Marie Pédrot (Oct 08 2020 at 10:01):

When are you planning to perform it?

Ralf Jung (Oct 08 2020 at 10:02):

I already did

Ralf Jung (Oct 08 2020 at 10:03):

so my PR should pass CI now

Pierre-Marie Pédrot (Oct 08 2020 at 10:03):

Indeed, I was just reading https://gitlab.com/coq/coq/-/jobs/779609904 :face_palm:

Ralf Jung (Oct 08 2020 at 10:03):

it got cancelled due to some limit or so it seems

Pierre-Marie Pédrot (Oct 08 2020 at 10:03):

Right, I'll close my PR then.

Pierre-Marie Pédrot (Oct 09 2020 at 21:56):

Uh, CI is not starting on https://github.com/coq/coq/pull/13168. Is that normal?

Pierre-Marie Pédrot (Oct 09 2020 at 21:57):

... despite two force pushes.

Pierre-Marie Pédrot (Oct 09 2020 at 21:57):

Coqbot is dead, long live to coqbot?

Pierre-Marie Pédrot (Oct 09 2020 at 21:59):

Answering to myself: github is having problems with webhooks it seems according to https://www.githubstatus.com/

Karl Palmskog (Oct 09 2020 at 21:59):

Travis is dead as well, this is a systemic issue

Karl Palmskog (Oct 09 2020 at 22:00):

my latest commit to a coq-community project is not even reflected via pull request web interface

Théo Zimmermann (Dec 30 2020 at 06:54):

@Guillaume Melquiond Something surprising happened. CompCert started failing suddenly in Coq's CI in both the CompCert job and the VST job (which includes a bundled version of CompCert). This seems unrelated to any change in Coq since it failed suddenly in all builds with no related change. However, since CompCert and its VST copy are not simultaneously updated, I would have expected to see a change in the master branch of Flocq that would explain the failure, but the last commit shown at https://gitlab.inria.fr/flocq/flocq/-/tree/master is from 2 month ago.
See https://gitlab.com/coq/coq/-/jobs/936644456 for the failing log.

Guillaume Melquiond (Dec 30 2020 at 07:02):

That is expected. CompCert is not compatible with Flocq's master. (Actually, not much will be compatible with Flocq's master, since it paves the way to Flocq 4.) CompCert's developers have been aware of it since early September: https://github.com/AbsInt/CompCert/pull/368. But there is nothing they can do about it for now. Anyway, this does not matter much for CompCert, since it vendors Flocq.

Théo Zimmermann (Dec 30 2020 at 08:56):

Thanks for your quick answer @Guillaume Melquiond, unfortunately I didn't notice until now.

Théo Zimmermann (Dec 30 2020 at 08:56):

But why did it start failing just now despite the last commit on Flocq's master being two months old?

Théo Zimmermann (Dec 30 2020 at 08:58):

And in Coq's CI we were not using the vendored version to avoid building Flocq three times. Do you have a release branch for Flocq 3 that you maintain and that we could use in Coq's CI instead of master?

Enrico Tassi (Dec 30 2020 at 09:03):

I suggest you crate a coq-master branch we track in ci and in which you merge master manually, from time to time, when it fits (eg concert catches up). I do this way for elpi (I develop against Coq stable for example).

Guillaume Melquiond (Dec 30 2020 at 09:05):

That is a conjunction of two things. First, Coq's CI stopped using CompCert's vendored Flocq only recently (in preparation of the 8.13 release). Second, I had temporarily stashed the main commits out of Flocq's master to prepare Flocq 3.4 (in preparation of the 8.13 release). Thus, the issue with Coq's CI went unnoticed because both things happened around the same time.

Guillaume Melquiond (Dec 30 2020 at 09:19):

Just to be clear, it might take several years before CompCert catches up to Flocq's master. (First, Flocq 4 needs to be released. Second, CompCert needs to decide to drop Flocq 3 in favor of Flocq 4.) So, a coq-master branch does not seem like a good idea (or I misunderstood it). But you can decide to track the flocq-3 branch instead of master. That said, this is only delaying the issue. Because at some point, CompCert's master and CoqInterval's master will disagree (even if only temporarily) on the major version of Flocq they are compatible with.

Michael Soegtrop (Dec 30 2020 at 09:49):

CompCert's master and CoqInterval's master will disagree (even if only temporarily) on the major version of Flocq they are compatible with.

This is a serious issue for people working on verification of floating point C code. In the Coq Platform I think we have to pick the latest combination of versions which share the same version of Flocq until this is solved. I thought about if it makes sense to have a separate package for flocq-4 during the transition period, but after thinking it through I came to the conclusion that this is better solved with a separate opam switch for Flocq 3 and Flocq 4 based tools.

I will switch Coq Platform master to track flocq-3 and see what happens.

Enrico Tassi (Dec 30 2020 at 09:50):

it might take several years

We can maybe add 2 jobs for flocq, flocq-3 used by compcert, and flocq-4 (to be used at some point), with the plan to remove flocq-3 eventually.

Michael Soegtrop (Dec 30 2020 at 09:53):

@Enrico Tassi : as I said I also thought about this, but rejected it because it would require a lot of additional -3 and -4 packages, say for CoqInterval. My conclusion is that it would be better to have a separate switch - say an option in the Coq Platform like Flocq3 with full set of packages or Flocq4 with reduced set of packages. But it is more a 70%/30% pro/con thing.

Michael Soegtrop (Dec 30 2020 at 10:00):

But I guess it would make sense to have an opam package flocq.dev.3 or flocq.3.dev. Which one?

Enrico Tassi (Dec 30 2020 at 10:05):

I was talking about Coq's CI, not the platform.

It is customary to have .$branch.dev so I'd pick the latter (in Coq it's v8.X.dev IIRC)

Enrico Tassi (Dec 30 2020 at 10:06):

Coq's ci (for the good and the bad) is not based on opam, so a job is a bunch of commands to fetch and build a package. I'd pick only one version of each dependency under CI, moving them on top of the flocq-4 once all ready.

Théo Zimmermann (Dec 30 2020 at 10:26):

Let's test only flocq-3 for now. Testing two branches would be a bit inconvenient for overlay creation.

Théo Zimmermann (Dec 30 2020 at 10:28):

@Guillaume Melquiond If you could give us a heads-up when you switch CoqInterval master to requiring Flocq 4, then we could introduce the two jobs at that point, or switch CoqInterval to a stable branch if you maintain one at that point.

Ali Caglayan (Mar 28 2022 at 11:20):

elpi has been broken for 2 days, what is going on?

Ali Caglayan (Mar 28 2022 at 11:20):

https://gitlab.com/coq/coq/-/pipelines?page=1&scope=finished&ref=master

Pierre-Marie Pédrot (Mar 28 2022 at 11:22):

@Enrico Tassi mentioned he was going to fix this today in another thread

Enrico Tassi (Mar 28 2022 at 14:21):

It should be fixed now. The job reported by @Gaëtan Gilbert https://github.com/LPCIC/coq-elpi/issues/354 was successfully re-run

Ali Caglayan (Mar 28 2022 at 17:05):

Current issue with itauto is waiting for this overlay to be merged: https://gitlab.inria.fr/fbesson/itauto/-/merge_requests/5

Ali Caglayan (Apr 04 2022 at 15:40):

@Maxime Dénès Looks like jasmin is failing on CI, but I cannot understand why. Upstream on GitLab there are no failures: https://gitlab.com/jasmin-lang/jasmin

Maxime Dénès (Apr 04 2022 at 15:45):

Thanks, I'll have a look

Gaëtan Gilbert (Apr 04 2022 at 15:56):

failing because of the ident/name change in https://github.com/coq/coq/pull/15754

Ali Caglayan (Apr 04 2022 at 15:59):

Ah thats a shame, I should have restarted the CI before merging

Ali Caglayan (Apr 04 2022 at 16:07):

I'll submit a patch to Jasmin

Ali Caglayan (Apr 04 2022 at 18:43):

Here is the patch https://github.com/jasmin-lang/jasmin/pull/98

Ali Caglayan (Apr 22 2022 at 13:23):

Fiat crypto is broken but that has been discussed. Unimath seems to be running over 3 hrs .

Pierre-Marie Pédrot (Apr 22 2022 at 15:02):

it seems to be hovering around the limit, in some runs it passes and in some others it fails

Pierre-Marie Pédrot (Apr 23 2022 at 08:55):

Unimath dies because an OOM rather than the 3h limit, it seems.

Command terminated by signal 9
UniMath/Bicategories/PseudoFunctors/Preservation/BiadjunctionPreserveInserters.vo (real: 1134.77, user: 440.04, sys: 8.37, mem: 2937224 ko)
build/CoqMakefile.make:790: recipe for target 'UniMath/Bicategories/PseudoFunctors/Preservation/BiadjunctionPreserveInserters.vo' failed

Pierre-Marie Pédrot (Apr 23 2022 at 08:56):

In the CI script we already unplug one suspiciously similar looking file from the compilation process

  # DisplayedInserter consumes too much memory for the shared workers
  sed -i.bak 's|DisplayedBicats/Examples/DisplayedInserter.v||'  UniMath/Bicategories/.package/files

Pierre-Marie Pédrot (Apr 23 2022 at 08:57):

maybe we should do the same (or ask the unimath people to try to keep memory consumption at bay)

Pierre-Marie Pédrot (Apr 23 2022 at 21:07):

I don't see anything suspicious in that file, it's just that this is a computation-heavy dev that consumes ~4GiB of memory on my machine

Pierre-Marie Pédrot (Apr 23 2022 at 21:07):

is there an easy way to bump up the available amount of memory for the CI runners?

Pierre-Marie Pédrot (Apr 23 2022 at 21:07):

or should we just deactivate this file as we did for the other in the script (this seems to be dead code btw)

Ali Caglayan (Apr 27 2022 at 11:13):

Shall we deactivate some files then?

Enrico Tassi (Apr 27 2022 at 12:16):

I think we should ask the upstream to provide a target called, say, coq-ci, which tests relevant parts and runs is 1h time.
I've been staring at CI for like forever recently, we can't have a stack of jobs taking hours

Enrico Tassi (Apr 27 2022 at 12:16):

I mean, it should be part of the CI requirements

Pierre-Marie Pédrot (Apr 27 2022 at 12:22):

From a CI point of view, unimath is both a very interesting dev and a terrible one. It's interesting because it stresses a lot Coq, and it's terrible precisely because of that.

Ali Caglayan (Apr 27 2022 at 12:28):

I suppose unimath could be split into many smaller opam packages under a virtual package. This would allow us to test it much faster.

Pierre-Marie Pédrot (Apr 27 2022 at 12:55):

@Jason Gross fiat-crypto is now broken upstream, it seems (e.g. https://gitlab.com/coq/coq/-/jobs/2382855355)

Ali Caglayan (Apr 27 2022 at 13:10):

Looks like there was a recent direct push to master https://github.com/mit-plv/fiat-crypto

Enrico Tassi (Apr 27 2022 at 13:19):

Splitting into parallel packages is something orthogonal to our CI. We should describe requirements, not how to satisfy them, IMO.
About full unimath, I guess you want the bench to fully run it. But if your PR is not about performance (e.g. cleaning up an API) you simply gain nothing in running it fully. And you may still run a bench, but once.

Ali Caglayan (Apr 27 2022 at 13:24):

Now VST is broken https://gitlab.com/coq/coq/-/jobs/2383881008

Ali Caglayan (Apr 27 2022 at 13:29):

Enrico Tassi said:

Splitting into parallel packages is something orthogonal to our CI. We should describe requirements, not how to satisfy them, IMO.
About full unimath, I guess you want the bench to fully run it. But if your PR is not about performance (e.g. cleaning up an API) you simply gain nothing in running it fully. And you may still run a bench, but once.

I suppose we could update the policy to add a time limit. However jobs like fiat-crypto may be long but are useful to test in the CI since they have a lot of coverage. Do we play favorites with big jobs?

Ali Caglayan (Apr 27 2022 at 13:30):

That is also possible to split into smaller packages, which we could ask the fiat-crypto developers to help us set up.

Ali Caglayan (Apr 27 2022 at 13:31):

bedrock, vst, fiat-crypto-legacy take 1:30h to finish, unimath and fiat-crypto take 2:30h

Ali Caglayan (Apr 27 2022 at 13:31):

We could ask that CI projects keep their time under 2hrs and suggest splitting into smaller projects to allow for this.

Ali Caglayan (Apr 27 2022 at 13:33):

But I think we need to be in agreement how long it should take for the CI to run. For me 3hrs per job is kind of OK, but I would be interested to hear other peoples opinion.

Enrico Tassi (Apr 27 2022 at 13:52):

3h per job means that you can push/fix a PR once a day, at most. The first time you discover the failure, then you burn 2h CPU on your computer, you fix you push the overlay. Ditto when you "just rebase" a PR, or just fix the doc. This means anyone taking care of your PR has to act the next day if he waits for CI to be green (imagine you push at 3PM). IMO, it is totally out of control.

Pierre-Marie Pédrot (Apr 27 2022 at 15:00):

and now VST is broken as well: https://gitlab.com/coq/coq/-/jobs/2383649606

Pierre-Marie Pédrot (Apr 27 2022 at 15:01):

The first time you discover the failure, then you burn 2h CPU on your computer, you fix you push the overlay

@Enrico Tassi this was true before the introduction of the minimizer, but not anymore.

Gaëtan Gilbert (Apr 27 2022 at 15:04):

with the minimizer you burn 24h cpu on microsoft computers?

Ali Caglayan (Apr 27 2022 at 15:11):

Right but once you have the minimized file, making a patch is way easier than running 2hrs every time.

Pierre-Marie Pédrot (Apr 27 2022 at 15:31):

^this

Enrico Tassi (Apr 27 2022 at 16:26):

the minimizer is nice, but does not cover the case of an overlay. only of a bug in your new code.

Pierre-Marie Pédrot (Apr 27 2022 at 20:37):

If I read the VST error correctly, we now need to install Flocq3 in the CI.

Jason Gross (Apr 29 2022 at 10:25):

Is there a way the minimizer could be made to help cover overlays? For example, the minimizer does have a mode where it doesn't admit lemmas. Or is the hard part of overlays that the first failure might not be the only one?

Gaëtan Gilbert (Apr 29 2022 at 12:43):

compcert now broken, I guess due to https://github.com/AbsInt/CompCert/pull/368 (flocq4 stuff)

Pierre-Marie Pédrot (Apr 29 2022 at 12:51):

why does everybody always synchronize right before the branch to break everything in the CI?

Enrico Tassi (Apr 29 2022 at 15:12):

it's called "deadline approaching"

Gaëtan Gilbert (May 02 2022 at 11:17):

so what do we do with unimath?

Last updated: Oct 13 2024 at 01:02 UTC