Hi all,
Since Wednesday, we (@Maxime Dénès and I) have switched the GitLab CI of the Coq repository to use Inria GitLab, which allows us to have better managed and more reliable custom runners. There may be some rough edges in the beginning, so please report them here if that's the case.
@Erik Martin-Dorel I guess we should duplicate the webhook for Docker-Coq that was present on the Coq GitLab.com repo, correct?
What about the currently pending pull requests? Will they 1. fail, 2. use the old CI, or 3. use the new CI?
Actually, one of those instantly restarted when I took a look at it. Not sure if it is a pure coincidence, of if just looking at them is enough.
It's not magic, I've just restarted two runs to double-check...
Guillaume Melquiond said:
What about the currently pending pull requests? Will they 1. fail, 2. use the old CI, or 3. use the new CI?
Any pipeline that was started before the switch will run on the old CI and results will still be reported. Any pipeline started after the switch (including using coqbot commands) should normally run on the new CI, including on old PRs.
Actually, it seems to be running on the new CI, unless Pierre-Marie did something special when restarting it.
What did you do to restart the pipelines @Pierre-Marie Pédrot?
(deleted)
I just clicked on "retry failed jobs" in the interface
Hum, OK. For this feature, I would have expected it to re-start the jobs on the old CI, but maybe I'm unclear on how this particular feature works...
these pipelines had already run on the new CI, they failed because of some weird transient error
Ah OK, that explains it.
BTW, who should we ask to get rights to manually restart jobs?
What do you need for that? A gitlab account + some membership level in the Coq organization?
Yes, but note that you don't need that to restart failed jobs, as this is already doable from the GitHub interface.
@Pierre Roux Do you have a specific need?
Indeed, that's probably enough for most cases. No specific need but sometimes you just want to restart a specific job (for instance after having updated an overlay).
Indeed. We should probably extend the coqbot command language to give every regular Coq contributor the opportunity to perform these operations without needing an Inria GitLab account.
@Théo Zimmermann : would it help Coq Platform CI to do a similar move?
For sure, we could use it for the Coq Platform CI as well, but this means rewriting the GitHub workflows into GitLab CI pipelines.
Note that we can also do a partial move (e.g., start by only moving Linux CI, especially as I don't know whether having Windows / Mac runners would be easy on Inria's GitLab).
Théo Zimmermann said:
Erik Martin-Dorel I guess we should duplicate the webhook for Docker-Coq that was present on the Coq GitLab.com repo, correct?
I don't think so, the webhook you may be speaking about was useful to trigger a manual CI pipeline in an external repo (i.e., so that each push in coq.master triggers a new pipeline in docker-coq, which is not a mirror of coq, but a sibling repo)
So, this is not needed.
Pierre-Marie Pédrot said:
these pipelines had already run on the new CI, they failed because of some weird transient error
@Théo Zimmermann This makes me think that after this migration, this feature wish may be important to be addressed:
https://github.com/coq/bot/issues/289
just to summarize the context: I noticed in gitlab.inria.fr/math-comp/docker-mathcomp that when there are pipeline with many jobs (here, 49) and each job requires a docker pull
from Docker Hub, the probability of a spurious network failure is high, and having coqbot retry only 3 times automatically and that's all, can be a major usability issue.
Erik Martin-Dorel said:
Théo Zimmermann said:
Erik Martin-Dorel I guess we should duplicate the webhook for Docker-Coq that was present on the Coq GitLab.com repo, correct?
I don't think so, the webhook you may be speaking about was useful to trigger a manual CI pipeline in an external repo (i.e., so that each push in coq.master triggers a new pipeline in docker-coq, which is not a mirror of coq, but a sibling repo)
So, this is not needed.
Yes, but we have stopped the mirroring of Coq to GitLab.com, so the Docker-Coq repository will not receive any new webhooks when the Coq repo is updated.
Ah yes! You're right then, the migration of the webhook is needed as well.
Let me know if I have to check the pipelines logs tonight, after you setup this one.
Erik Martin-Dorel said:
https://github.com/coq/bot/issues/289
just to summarize the context: I noticed in gitlab.inria.fr/math-comp/docker-mathcomp that when there are pipeline with many jobs (here, 49) and each job requires a
docker pull
from Docker Hub, the probability of a spurious network failure is high, and having coqbot retry only 3 times automatically and that's all, can be a major usability issue.
By curiosity, I went and looked at the issue. It is great that you have exponential backoff, but you are missing an important piece. You need to make the sleep length random. For example, if your current backoff is 25 seconds, you should not sleep for 25 seconds; you should sleep between 25 and 25*2=50 seconds. Otherwise, your 49 failing jobs will all wake up at the exact same time, and thus they will fail again.
Thanks for the comment. I think it deserves to be copied over there ;-)
@Théo Zimmermann : thanks - I will think about it, but I think I prefer a solution which supports all platforms.
Another solution is of course to plug in custom GitHub Actions runners to overcome the 6h limit.
Yes. I had a discussion with Romain Tetely (he works with Enrico) yesterday and he want to look into this (using INRA HW).
Hi folks, could you maybe update the rest of the CI team on the status and plan for this migration?
In particular I have a few comments:
in the previous attempt to migrate to inria.gitlab.fr we found problems with the limit in artifact size, was this solved?
I'm not sure what was the limit previously, but it doesn't seem to have hit us this time.
should we take advantage of the extra tags availabe in Inria CI ? I was able to use those to great benefit actually
That's irrelevant to us AFAICT because we are not using the shared runners but another infrastructure that allows us to get pre-configured custom runners. I hope that @Maxime Dénès will be able to do a write-up with links to relevant Inria doc, so that it is easy for others to know what's going on and how to maintain it. AFAIU, the shared runners were too few for the large usage we would have made of it.
what other steps you had to do to adapt our current setup?
Apart from the preliminary tests of Maxime and his interaction with the Inria SED team behind the custom runners infrastructure, not much, because coqbot had already been adapted previously (with Erik) to support pushing to alternative GitLab instances.
is documentation updated? Does the new setup require extra documentation? I see that we our CI docs still mention gitlab.com everywhere
It is not. It has been on my to-do to update it for a few weeks already.
I see, I was not aware of the custom runner setup, thanks for the update!
Let me know if you need help with the docs update, we could maybe take this chance to rework a bit the CI docs together like for an hour if you want, and do the rests of TODOs async
What are the specs of the custom runners? The large shared runners were pretty nice, so maybe we could selectively use those, but we need to document what Inria SED said about it I guess.
From what I've understood, Inria SED said to not use the shared runners for Coq, because the load would be too big. I don't know the spec of the custom runners. I hope they are documented somewhere or will be soon. Maybe we can tag @Thierry Martinez, who probably knows everything about this.
The documentation for the custom runner is here: https://gitlab.inria.fr/inria-ci/custom-runner/ . It's still in beta: you are welcome to fill new issues (using the same link) if there are unexpected failures. The Coq CI already helped a lot to improve the implementation of the runners.
Thanks @Thierry Martinez , oh so the custom runners we are using now, are really managed too, right?
The setup looks pretty nice!
Thanks to this doc, I realize that there is macOS and Windows support too, so this means that we could migrate our GitHub workflows to use this if we want to. And it could help for the Coq Platform as well.
Yes and we get the large runners too, that can really reduce the critical path
for example if we implement https://github.com/coq/coq/issues/18050 , a large runner can make the compilation of all plugins pretty quick (one min or so I'd predict)
I wonder what I should do with Coq Platform CI. Should I migrate it to gitlab, split it into many small projects ...?
Unless you use some advanced GitHub Actions that would be difficult to replace in a GitLab workflow, migrating to GitLab (Inria) could indeed be a reasonable option. Something to discuss with Romain, I suppose!
I have a meeting with @Romain Tetley tomorrow and will address this. We need to come back to build full/extended installers in CI.
BTW looks like we're already using about as much artifact space on inria as on gitlab.com
https://gitlab.inria.fr/coq/coq/-/usage_quotas#storage-quota-tab
https://gitlab.com/groups/coq/-/usage_quotas#storage-quota-tab
That's huge and worrying!
I guess artifact expiry didn't quick in yet? I wonder what the "stable" size is. 1 TiB of data is indeed a lot, on the other hand seems about right for our load, if that's properly cleaned up.
If that's a problem we could optimize quite a few things I think.
we had https://docs.gitlab.com/ee/administration/settings/continuous_integration.html#keep-the-latest-artifacts-for-all-jobs-in-the-latest-successful-pipelines checked (I'm unchecking now, it was already unchecked on .com)
Nice @Gaëtan Gilbert , maybe we should start collecting these kind of tweaks in a document for the doc update?
I tend to forget all these details very quickly
Tho that setting should not have a large impact, right? The doc is a bit ambigous.
with it checked we keep artifacts for all the old open PRs
impact unclear tbh
Oh yes, that is because we push the branches to gitlab
I was thinking on the case of the main repos where we only have the official branches
that's for sure quite a bit of impact, yes
Added some notes to the wiki https://github.com/coq/coq/wiki/State-of-the-continuous-integration-infrastructure#migration-to-gitlab-ci
So we can get back to them when updating the docs
I don't know if it is already known, gitlab.inria.fr will be unavailable for maintenance from Monday 16 Oct 1pm to, at latest, Tuesday 17 oct 9am (Paris time).
Do we want to switch back temporarily to GitLab.com during these planned maintenance downtimes?
If it's only half a day we can probably live with it (I mean, with fiat-crypto, that's just one or two CI cycles).
Note that if GitLab Inria is down, any PR updated during the meantime won't have any GitLab CI running at all. It will not just be delayed like with email.
I guess the downtime is small enough so we can retrigger the workflows manually.
Yep. Using @coqbot run full / light ci
is enough to retrigger missing pipelines
IIUC, we have now moved away from GitLab.com's CI so I don't need to renew the OSS program application, right? @Théo Zimmermann, @Maxime Dénès ?
We are still using it for the Opam archive.
Ah, ok
And I think that since the migration is very recent, it's safer to keep the option to go back to GitLab.com, at least for the coming year.
In one year, we should have gained enough experience, and possibly we will have migrated the opam archive to GitLab Inria as well. We'll see.
hmm, one drawback of Inria GitLab is that non-Inria people will all need to maintain sponsored accounts, right? adds overhead to co-maintainerships
Yes if they need to interact with GitLab directly, but I think we can minimize the number of interactions needed. Currently, it is already possible to restart failed jobs from the GitHub interface. We should add support to allow running other common operations from GitHub.
I unregistered the runner by mistake from GitLab: I registered it again, but as far as I know, I can't preserve it enable on the projects it was previously enabled, sorry for this. Can anyone with admins right on gitlab.inria.fr/coq/coq enable inria-ci/custom-runner again in https://gitlab.inria.fr/coq/coq/-/settings/ci_cd? (If it is not already the case, I have to add the person to gitlab.inria.fr/inria-ci/custom-runner first.) Thanks, and sorry again. Hopefully, such inconveniences will not happen when the custom runner will be a shared runner in GitLab sense (which should happen soon now).
I can do it if you add me
ping @Thierry Martinez
maybe @Maxime Dénès can act too?
CI is dead until we add the runners back
I can't add somebody as I'm not owner in that org. I'll ask someone who is.
Oh, in fact I can. @Gaëtan Gilbert should be ok now.
I have pushed the button
Inria GitLab (https://gitlab.inria.fr/) seems to be down in general, is this planned?
Iirc, this was planned, but it was announced to be for just one hour.
Actually, it was two hours: "interruption de service entre 17h et 19h".
getting more of these python errors https://gitlab.inria.fr/coq/coq/-/jobs/3466937
getting more of these python errors
Can it be related to an empty list of vm?
Last updated: Nov 29 2023 at 22:01 UTC