Stream: Coq devs & plugin devs

Topic: Timeouts in the CI


view this post on Zulip Pierre-Marie Pédrot (Sep 04 2023 at 06:29):

Since the switch of the CI to the INRIA GitLab, I'm seeing quite a few timeouts, especially for metacoq but also the sempiternal suspects like fiat-crypto. The underlying runners are the same, right? If so what could have caused this issue and how can we fix it?

view this post on Zulip Maxime Dénès (Sep 04 2023 at 06:48):

Just raise the timeouts, no?

view this post on Zulip Pierre-Marie Pédrot (Sep 04 2023 at 07:42):

Apart from being a bit ad-hoc, it's still not great to have such a humongous CI. But well, I get the idea.

view this post on Zulip Pierre Roux (Sep 04 2023 at 08:14):

What about the following (I unfortunately don't have neither time nor skills to implement it):

view this post on Zulip Théo Zimmermann (Sep 04 2023 at 08:33):

We cannot reasonably exclude metacoq from the CI that runs on every PR, but maybe we could do it if it was split into smaller targets.

view this post on Zulip Théo Zimmermann (Sep 04 2023 at 08:34):

In any case, I think that for now, the solution is indeed to just raise a few timeouts.

view this post on Zulip Pierre-Marie Pédrot (Sep 04 2023 at 09:03):

There you go: https://github.com/coq/coq/pull/18009

view this post on Zulip Maxime Dénès (Sep 04 2023 at 11:13):

Pierre-Marie Pédrot said:

Apart from being a bit ad-hoc, it's still not great to have such a humongous CI. But well, I get the idea.

I fully agree. It would be better to rely on a test suite derived from specs for the majority of the coverage, and have only a few selected downstream projects in the CI, IMHO.

view this post on Zulip Karl Palmskog (Sep 04 2023 at 11:15):

aren't there also some "long running" projects that can be greatly speeded up by adding more cores to runners? for example, both Verdi Raft and Fourcolor get a lot of speedup from say 4-8 cores.

view this post on Zulip Maxime Dénès (Sep 04 2023 at 11:16):

Cores aren't free, unfortunately ;)

view this post on Zulip Maxime Dénès (Sep 04 2023 at 11:17):

But yes, for now all jobs are running on a "medium" runner flavor. You can switch a few jobs to "large" if it helps.

view this post on Zulip Karl Palmskog (Sep 04 2023 at 11:17):

I'd recommend it for fourcolor at least ("large")

view this post on Zulip Pierre Roux (Sep 04 2023 at 11:31):

I wouldn't, the speed up in parallelizing fourcolor is mostly doing the same thing over and over. It's necessary and worth it when you want the complete proof but in a CI if the first job passes, it's very unlikely any of the other 600 fail so that's just wasted CPU time. I'd rather do a "CI/lite" target only compiling one job.

view this post on Zulip Karl Palmskog (Sep 04 2023 at 11:32):

OK, fine by me. At least Verdi Raft could use "large" - due to its typeclass-based proof structure all parallel proofs will be used and will be quite different

view this post on Zulip Théo Zimmermann (Sep 04 2023 at 12:11):

If we're OK with having a broken master CI from time to time, it would be easy to define a set of jobs that are not required to run on PRs before they are merged (they could still be requested on a case-by-case need with e.g., @coqbot run extra CI) and that would only run on master. Obviously, the solution to do the same without risking breaking master is to have a merge queue, but this is difficult to implement because of the moving overlays. But in my opinion, we could already save a lot of resources by implementing some caching mechanism, be it with Dune or with Nix (e.g., for when you push a changelog to a PR, or when you merge a PR and it re-runs exactly the same CI jobs in the same state as when you did that in the PR branch before).

view this post on Zulip Jason Gross (Sep 05 2023 at 22:49):

Is there a way to detect when the CI is being run in a merge queue and have it run overlays that were introduced since master?

view this post on Zulip Gaëtan Gilbert (Sep 06 2023 at 09:02):

we don't use merge queue

view this post on Zulip Théo Zimmermann (Sep 06 2023 at 09:23):

My opinions about what we need to enable a merge queue haven't changed much since I wrote my comments in https://github.com/coq/coq/issues/6724.

view this post on Zulip Pierre-Marie Pédrot (Sep 07 2023 at 09:57):

Even the base test-suite is failing with timeouts these days: https://gitlab.inria.fr/coq/coq/-/jobs/3396742 This seems pretty new, is there something wrong going on with the new setup?

view this post on Zulip Pierre-Marie Pédrot (Sep 07 2023 at 09:59):

(A quick glance at the gitlab CI seems to indicate it was in the ballpark of 30 minutes before, so that's almost a x2 increase)

view this post on Zulip Pierre-Marie Pédrot (Sep 07 2023 at 10:03):

We're also having reliability issues with the workers, e.g. https://gitlab.inria.fr/coq/coq/-/jobs/3396633 which happened several times in the last few days.

view this post on Zulip Théo Zimmermann (Sep 07 2023 at 10:26):

FWIW, if there are random worker failures that are not solved rapidly, we can also use our existing coqbot infrastructure for auto-restarting spurious failures to work around them.


Last updated: Dec 05 2023 at 11:01 UTC