Since the switch of the CI to the INRIA GitLab, I'm seeing quite a few timeouts, especially for metacoq but also the sempiternal suspects like fiat-crypto. The underlying runners are the same, right? If so what could have caused this issue and how can we fix it?
Just raise the timeouts, no?
Apart from being a bit ad-hoc, it's still not great to have such a humongous CI. But well, I get the idea.
What about the following (I unfortunately don't have neither time nor skills to implement it):
extensive_ci
that could be ran only once per week/month (it could even ping some responsible in case of failure or run a bisect to point that responsible to the first failing commit)We cannot reasonably exclude metacoq from the CI that runs on every PR, but maybe we could do it if it was split into smaller targets.
In any case, I think that for now, the solution is indeed to just raise a few timeouts.
There you go: https://github.com/coq/coq/pull/18009
Pierre-Marie Pédrot said:
Apart from being a bit ad-hoc, it's still not great to have such a humongous CI. But well, I get the idea.
I fully agree. It would be better to rely on a test suite derived from specs for the majority of the coverage, and have only a few selected downstream projects in the CI, IMHO.
aren't there also some "long running" projects that can be greatly speeded up by adding more cores to runners? for example, both Verdi Raft and Fourcolor get a lot of speedup from say 4-8 cores.
Cores aren't free, unfortunately ;)
But yes, for now all jobs are running on a "medium" runner flavor. You can switch a few jobs to "large" if it helps.
I'd recommend it for fourcolor at least ("large")
I wouldn't, the speed up in parallelizing fourcolor is mostly doing the same thing over and over. It's necessary and worth it when you want the complete proof but in a CI if the first job passes, it's very unlikely any of the other 600 fail so that's just wasted CPU time. I'd rather do a "CI/lite" target only compiling one job.
OK, fine by me. At least Verdi Raft could use "large" - due to its typeclass-based proof structure all parallel proofs will be used and will be quite different
If we're OK with having a broken master CI from time to time, it would be easy to define a set of jobs that are not required to run on PRs before they are merged (they could still be requested on a case-by-case need with e.g., @coqbot run extra CI
) and that would only run on master. Obviously, the solution to do the same without risking breaking master is to have a merge queue, but this is difficult to implement because of the moving overlays. But in my opinion, we could already save a lot of resources by implementing some caching mechanism, be it with Dune or with Nix (e.g., for when you push a changelog to a PR, or when you merge a PR and it re-runs exactly the same CI jobs in the same state as when you did that in the PR branch before).
Is there a way to detect when the CI is being run in a merge queue and have it run overlays that were introduced since master?
we don't use merge queue
My opinions about what we need to enable a merge queue haven't changed much since I wrote my comments in https://github.com/coq/coq/issues/6724.
Even the base test-suite is failing with timeouts these days: https://gitlab.inria.fr/coq/coq/-/jobs/3396742 This seems pretty new, is there something wrong going on with the new setup?
(A quick glance at the gitlab CI seems to indicate it was in the ballpark of 30 minutes before, so that's almost a x2 increase)
We're also having reliability issues with the workers, e.g. https://gitlab.inria.fr/coq/coq/-/jobs/3396633 which happened several times in the last few days.
FWIW, if there are random worker failures that are not solved rapidly, we can also use our existing coqbot infrastructure for auto-restarting spurious failures to work around them.
Last updated: Dec 05 2023 at 11:01 UTC