It's becoming urgent, I just got (and probably so you did) a gitlab email scaringly stating
coq namespace has 30% or less Shared Runner Pipeline minutes remaining. Once it runs out, no new jobs or pipelines in its projects will run.
I think they changed how they count CI usage
counted usage:
usage.png
actual time spent:
time.png
it used to be time = usage * 25
but this month it's time = usage * 2
so now we are running out
see also https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html#cost-factor
0.5 for public projects in the GitLab for Open Source program.
1 for other public projects, after October 1, 2022 (previously 0.04).
we could try to have core devs open forks and run their pipelines there (forks get 125 factor * 2k free minutes instead of 2 factor * 50k free minutes)
the permission juggling could get complex though
we could buy more (we seem to be using about $50 / day) I don't know where the money would come from though
we could try running to gitlab.inria
we could try getting our own machines or vms somewhere
anyway at the current rate CI will break in a couple days
cc @Maxime Dénès (you probably need to be aware, especially if the solution ends up being buying runners somewhere)
I feel stabbed in the back by GitLab because they have not announced this change of factors for public projects widely enough that we noticed it (they could have send an email before it was almost too late!).
Maybe it's time to consider implementing https://github.com/coq/bot/issues/159 urgently.
I propose to urgently disable most CI jobs by default (depending on the value of a variable).
Then we can add the necessary support to coqbot to handle this variable automatically.
Is migrating some of the jobs to GitHub Actions a potential solution for the short term?
Not if we don't get some GitHub sponsorship IMO, otherwise our CI will be quickly throttled by the 4 parallel job limit.
According to the new cost factor, we are allowed 100,000 shared runner minutes. Assuming that we do not change the way opam-coq-archive works and based on the September stats (26,000 minutes for opam-coq-archive, 782,000 minutes for Coq), we should divide our CI consumption by more than a factor 10 on the Coq repository to stay in the new limits.
Unfortunately, even if we could achieve this instantly, it would have to be even more (like a factor 30) for the month of October, since we've already spent close to 70% of the available minutes in the first three days.
:bomb:
So this is no-CI October then?
I actually think it's a realistic target to achieve if we cut on the long-running jobs most of the time.
That's why I was saying:
I propose to urgently disable most CI jobs by default (depending on the value of a variable).
Urgently, as in now?
If we consumed 70% in 3 days, that's about 1% per hour.
Yes, as in today.
Who's doing that? I'm personally not CI-aware enough to write a patch, and this is a red alert :rotating_light:
I can
please!
I propose to merge your PR as soon as its CI finishes.
(Unironically, that is.)
Do we still use Inria VMs for our CI?
If not I can look into adding some runners there this week
we don't
the only remaining VM we have is coq-osx1, is it safe to delete?
it eats up a part of our memory quota
Let's discuss this in a call, but I think we can:
@Théo Zimmermann , instead of disabling jobs, why don't we tag them so they are only picked by our runners?
Maxime Dénès said:
the only remaining VM we have is coq-osx1, is it safe to delete?
It is not. AFAICT, it is still used to store the macOS signing key (Michael downloads it on its computer to sign packages, then deletes it).
There are some other choices to further optimize this as discussed in the https://github.com/coq/coq/issues/16201
Ok, noted. We still have enough quota to create some additional runners, though.
Emilio Jesús Gallego Arias said:
Théo Zimmermann , instead of disabling jobs, why don't we tag them so they are only picked by our runners?
If we don't disable most CI jobs now, we will have lost all of our remaining CI minutes. But of course, I'm in favor of deploying our own runners to pick up some of the load as soon as we can realistically do that.
Maxime Dénès said:
If not I can look into adding some runners there this week
It is not worth it to do so with the VM infra, it is better to use directly the runners implemented in gitlab.inria.fr
Emilio Jesús Gallego Arias said:
Let's discuss this in a call, but I think we can:
- use gitlab.inria.fr (it worked fine, see my branch and the experiments), only problem is that we need to ask for a bit larger artifact size limit
- deploy our runners
do you have a pointer to the branch?
Théo Zimmermann said:
If we don't disable most CI jobs now, we will have lost all of our remaining CI minutes. But of course, I'm in favor of deploying our own runners to pick up some of the load as soon as we can realistically do that.
Well, either by disabling most of CI or by running out of minutes, we still lose CI so we are in shit.
Maxime Dénès said:
Emilio Jesús Gallego Arias said:
Let's discuss this in a call, but I think we can:
- use gitlab.inria.fr (it worked fine, see my branch and the experiments), only problem is that we need to ask for a bit larger artifact size limit
- deploy our runners
do you have a pointer to the branch?
Yup one se
One sec
I think that the fastest is to migrate to gitlab.inria.fr
provided Inria allows us for enough quote
Emilio Jesús Gallego Arias said:
It is not worth it to do so with the VM infra, it is better to use directly the runners implemented in gitlab.inria.fr
No, that is not possible. The shared runners available in the Inria Gitlab are just a handful. We will bring the whole infrastructure to its knees if we dump the whole Coq CI on them.
https://gitlab.inria.fr/egallego/coq/-/pipelines/545950
That's the lastest run I think
Guillaume Melquiond said:
Emilio Jesús Gallego Arias said:
It is not worth it to do so with the VM infra, it is better to use directly the runners implemented in gitlab.inria.fr
No, that is not possible. The shared runners available in the Inria Gitlab are just a handful. We will bring the whole infrastructure to its knees if we dump the whole Coq CI on them.
@Guillaume Melquiond do you have a pointer to the resource assignation? I have a good idea on the computer power we need, it is actually not so much.
The current state is of the Inria runners is: "Pour l'instant il y a 8 slots pour le runner small, 4 pour le medium et 11 pour le large".
But are these real or virtual slots?
I mean per-project
Emilio Jesús Gallego Arias said:
404
Permission error...
No, that is for the whole instance. There are no more than 11 jobs in parallel that can be run globally.
I store private branches there so I can't make ejgallego/coq public, let me see
I think that the GitLab CI fallout is anyway a reminder that we have been burning lots of energy without paying enough attention and I honestly believe that the only solution is to be more careful about how we use CI.
https://github.com/ejgallego/coq/tree/inria-ci that's the last version I tried
Note that what I proposed is not to lose our CI but to not run the full version by default at every push on any PR.
Yes
@Gaëtan Gilbert you can push that branch to your own fork on inria CI and it should produce a CI
I can't open my own repos right now
Théo Zimmermann said:
I think that the GitLab CI fallout is anyway a reminder that we have been burning lots of energy without paying enough attention and I honestly believe that the only solution is to be more careful about how we use CI.
Yes, there is a lot we can optimize actually
We can't really blame gitlab for restricting this
Free CI can't be unlimited, the economics just can't work
Note that even with the tweak to the gitlab files, all PRs currently open and not rebased on master will trigger a full build.
We can't really blame gitlab for restricting this
No, we can't. The only thing I blame them for is not communicating more clearly a bit ahead of time.
Pierre-Marie Pédrot said:
Note that even with the tweak to the gitlab files, all PRs currently open and not rebased on master will trigger a full build.
Hum, no. coqbot does a merge with master and pushes that commit to the GitLab mirror.
Ah, that's a good news in an ocean of trouble then.
I'm surprised to hear that our load (which is not so big IMHO) would bring the Inria CI infra to its knees tho
I revised my estimations but a single reasonably modern server should handle our load just fine
Maybe it's time to whine to the inventeurs du monde numérique?
Do we not have the budget to buy more minutes?
Also moving to GitHub actions could be possible since there is an "enterprise plan" allowing 180 concurrent workers.
https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration
However we would lose the ability to inspect artifacts
Pierre-Marie Pédrot said:
Maybe it's time to whine to the inventeurs du monde numérique?
For sure if not for what Guillaume said, that would be a very good solution, as gitlab.inria.fr would be a drop-in replacement for us
so we would just need to push my branch to Coq, and update the bot to link to gitlab.inria.fr
but indeed we need to whine^W discuss with the people in charge
I understand this is way down the priority ladder, but this affects the Coq opam archive CI too, right?
@Gaëtan Gilbert do you mind pushing the inria-ci
branch to your own gitlab.inria.fr fork so we can get an Inria run
?
what's the point?
I think everyone believes that it works
Not @Guillaume Melquiond it seems. I'd be curious to see if the prophecy is true. 2022/10/03, aka "The day the Inria Gitlab stood still".
Karl Palmskog said:
I understand this is way down the priority ladder, but this affects the Coq opam archive CI too, right?
Yes, but the opam-coq-archive CI is not so big. If we were to remove or sufficiently scale down the Coq CI, then opam-coq-archive could continue its business-as-usual.
Gaëtan Gilbert said:
what's the point?
I think everyone believes that it works
I think everybody believes that it works at a small scale. The question is just whether it will handle the full scale.
But it is easy enough to test actually.
We can just create an Inria GitLab mirror for Coq now and double-push every PR to both the current repo and the new one (and set FULL_CI=true
over there).
Then we can see if we bring Inria's infrastructure down to its knees or not.
The Inria admins will surely appreciate.
Gaëtan Gilbert said:
what's the point?
I think everyone believes that it works
No, it doesn't fully work, we hit some limits on artifact size
A single push will work modulo the artifact size problem, but of course we should talk to the responsible person before pushing dozens of jobs
The question is what is the least amount of work? Is talking with the Inria admins for the CI the quickest path?
I don't know
I think exploring multiple routes in parallel is probably the safest path.
also more costly tho
@Théo Zimmermann Don't you also want to disable the test-suite+async job? This one never terminates on time and it takes uncompressible hours for no good reason.
Have we done some math to understand if this disable CI work is gonna work?
No time for maths, this is the STOP blinking red button.
I am afraid that in the end it is just wasted work, if even after the saving it doesn't work out for us and we are left without CI
Unfortunately, I haven't precise data enough to do this math, but from some plots by @Jason Gross that I have seen, I think that this should work.
This is the STOP blinking button, but imagine we are not pressing the right STOP button but the AUTO-DESTRUCT button
Emilio Jesús Gallego Arias said:
I am afraid that in the end it is just wasted work, if even after the saving it doesn't work out for us and we are left without CI
This was some work that I wanted to do anyway. I just didn't want to do it now.
so we rather read the labels on the buttons carefully before pushing
Recall that this issue https://github.com/coq/bot/issues/159 was opened in 2021 after a discussion that happened in 2020.
Any button is good to press when you're bound to fall into the void in a split second.
Merge first, talk later.
Théo Zimmermann said:
Emilio Jesús Gallego Arias said:
I am afraid that in the end it is just wasted work, if even after the saving it doesn't work out for us and we are left without CI
This was some work that I wanted to do anyway. I just didn't want to do it now.
Yes, tying to figure out what to build is a very interesting topic, we have been discussing about it in Coq-Universe too.
Pierre-Marie Pédrot said:
Any button is good to press when you're bound to fall into the void in a split second.
Yes, but are we bound to fall into the void in a split second? I don't see such urgency here.
Do we all agree that we don't want to drop CI for merges to main, right?
If so, let's keep calm and try to assess what our best path is.
Using FULL_CI + the bot could be a good stragegy, or maybe it doesn't work because we will still run out of minutes
gitlab.inria.fr seems way more promising but we need to understand if Inria can provide us the right capacity
Deploying our runner + tagging works could be a good solution, but I'm not sure if that's possible / what is the status of servers
Acting in panic rarely leads to a good outcome
Emilio Jesús Gallego Arias said:
Using FULL_CI + the bot could be a good stragegy, or maybe it doesn't work because we will still run out of minutes
I don't know if it will be a good long-term strategy or not but it buys us some time to consider which strategy is best.
Honestly I don't see what could go wrong with deactivating everything. Thanks to the magic of VCS, worst case we can revert or tweak the result.
But really, the thing we don't have right now is time. If 1% goes off every hour we're really screwed.
It's not really 1% every hour BTW: if someone opens a PR now, it probably will be more than 1% spent.
in that case, maybe consider disallowing PRs via GitHub for a while rather than have people open them by "mistake"
BTW, I just computed how much it would cost paying for the minutes we need without doing anything. That would be about €3500/month. Technically, we could pay for one month until we figure this out. That's still a lot of money to throw away I think.
Karl Palmskog said:
in that case, maybe consider disallowing PRs via GitHub for a while rather than have people open them by "mistake"
No need now: https://github.com/coq/coq/pull/16600 was merged.
I found this announcement from last year: https://about.gitlab.com/blog/2021/11/11/public-project-minute-limits/
Did something change on top of this?
I thought Coq would be exempt from the minute counting.
Pierre-Marie Pédrot said:
Not Guillaume Melquiond it seems. I'd be curious to see if the prophecy is true. 2022/10/03, aka "The day the Inria Gitlab stood still".
Let us be clear. There is currently a single 64-core server that is used for the whole Inria CI. It won't be long before we are forbidden to use it, if we are constantly hammering it with jobs from Coq's CI.
@Ali Caglayan See https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs-.26-plugin-devs/topic/Too.20many.20pipelines/near/302079565
This clearly changed on Oct. 1st
Although we were not aware of this change having been announced.
Can we contact GitLab since this is somewhat absurd?
Could even be an engineer accidentally omitted the 5.
They have updated the doc, so it looks like a change on purpose.
But sure, if there were no announcements, we should probably contact them to say that we need time to adjust.
I personally haven't researched if there were any announcement or twitter discussion about this.
We would only max out a 64-core server on special weeks, for the rest that'd be fine
Pierre-Marie Pédrot said:
Honestly I don't see what could go wrong with deactivating everything. Thanks to the magic of VCS, worst case we can revert or tweak the result.
Not sure I understand what you propose here. Deactivating everything means we don't have CI, so who's gonna do the testing / revert when a PR lands and a breaks everything
If I understand currently, as the PR has been merged, we should block PR merge too?
There is a "talk to an expert" button on https://about.gitlab.com/sales/. Can somebody ask them what is going on?
something like this: (I won't send it)
Hi, you seem to have changed how the minutes for the runners are being calculated. The doc also reflects this change.
I find this somewhat absurd as there was no announcement to the said change and it is heavily affecting our development CI pipeline.
Since the Coq organization is a paying for the runner services, it seems somewhat predatory that the calculations for minutes can be changed as such.
Please clarify the situation.
- Coq Devleopment team
we're not paying afaik
Like @Guillaume Melquiond, I'm not confident at all that Inria's Gitlab CI will handle the load, especially if several large projects migrate simultaneously
What I would do, is in parallel:
I can initiate both.
I think we can sequence the operations:
I agree it's better to not rush 2 too much. Doing a bit of temporary work for 1 does not seem problematic to me.
Yes, we need to address it. Still, TV news are not mentioning that we have CI issues, so I guess we can take the time to do things in order :)
Let's create a channel for communication actions that we take on this CI issue. We will use this one to discuss about the situation and what should be done.
Was there some communication done to coq devs about the process they should currently follow to merge PRs?
Should I refrain from pushing to PRs for the time being?
@Gaëtan Gilbert Should @Ali Caglayan rebase his PRs to get only the reduced CI or not?
The PR I am thinking about already has a lot of the jobs removed, but I am unsure how severe the situation is today. I don't want to use up resources that could be better used for others.
Maxime Dénès said:
Gaëtan Gilbert Should Ali Caglayan rebase his PRs to get only the reduced CI or not?
no need to rebase, coqbot auto merges
Ali Caglayan said:
Should I refrain from pushing to PRs for the time being?
push on, it will help tell us if we reduced enough
Looks reduced https://gitlab.com/coq/coq/-/pipelines/657296145
I came a bit late to this thread. I guess Coq Platform is not affected since I am using only GitHub actions?
I guess so, this seems to be a gitlab.com policy change.
That's very bad news :-(
BTW regarding the docker-coq infrastructure in coq-community: the docker-keeper
tooling specifically relies on GitLab CI.
Migrating to GitHub Actions is just not feasible (e.g. YAML anchors are not supported by GHA, nor the generation of children pipelines from a dynamic YAML…)
Fortunately, the auto-build/deployment of coqorg/*
images should not be as costly in CI/CD minutes as PRs pipelines.
But I cannot check the quotas myself in https://gitlab.com/groups/coq-community/-/usage_quotas#pipelines-quota-tab as only @Théo Zimmermann can → you may want to take a look at the row for coq-community / docker-coq
, in order to extrapolate
Here is the GitLab issue with the changes: https://gitlab.com/gitlab-org/gitlab/-/issues/337245
I googled a bit more and, according to this blog article https://about.gitlab.com/blog/2021/11/11/public-project-minute-limits/ that I had not spotted before, this drastic change is intended to fight cryptomining via CI pipelines.
But that's a pity that FOSS projects like Coq which consume much more pipeline time that the average of other projects, are badly impacted by these changes… Maybe, reaching them by the email opensource@gitlab.com mentioned in https://about.gitlab.com/handbook/marketing/community-relations/opensource-program/ to just let us know (?)
my 2 cents…
Finally FYI, it seems that there's another thing to double-check now:
the Namespace Storage Quota: https://gitlab.com/groups/math-comp/-/usage_quotas#storage-quota-tab
as according to https://docs.gitlab.com/ee/user/usage_quotas.html#namespace-storage-limit-enforcement-schedule
Storage limits for GitLab SaaS Free tier namespaces will not be enforced prior to 2022-10-19. Storage limits for GitLab SaaS Paid tier namespaces will not be enforced for prior to 2023-02-15.
IINM, coq
and coq-community
are not impacted for now, unlike math-comp
, which is still in the Free Tier…
(see e.g. this topic)
I don't quite see how this change keeps people from crypto mining in CI if forks remain at 0.008 - who prevents people of forking their own projects? I guess one could find more intelligent methods for preventing such abuse.
Maybe it is possible to talk to them and get an exception.
Ok, I got something automated enough (snapshots + Ansible playbook). I should be able to add ~8 runners tomorrow.
Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?
Michael Soegtrop said:
Maybe it is possible to talk to them and get an exception.
I very much doubt so. Our quota of 50'000 minutes corresponds to their "ultimate" offer, which normally costs $99 per user and which we are currently getting for free. There is not even a higher tier for paying customers. So, I don't see how they could provide an even higher tier for non-paying customers. Also keep in mind that their "free" offer is just 400 minutes, so we are currently getting 125x that. In other words, we are very lucky and we should not push our luck.
Maxime Dénès said:
Was there some communication done to coq devs about the process they should currently follow to merge PRs?
Communication happened by this message: https://github.com/coq/coq/pull/16600#issuecomment-1265773392. I know it's not ideal, but any dev with push right should have got it.
Gaëtan Gilbert said:
that's actually a setting https://docs.gitlab.com/ee/ci/pipelines/job_artifacts.html#keep-artifacts-from-most-recent-successful-jobs so maybe we should unset it
Yes, we should definitely use it. Something changed recently? It didn't use to be the case that old PRs have their artifacts kept.
Erik Martin-Dorel said:
Fortunately, the auto-build/deployment of
coqorg/*
images should not be as costly in CI/CD minutes as PRs pipelines.
But I cannot check the quotas myself in https://gitlab.com/groups/coq-community/-/usage_quotas#pipelines-quota-tab as only Théo Zimmermann can → you may want to take a look at the row forcoq-community / docker-coq
, in order to extrapolate
The entire coq-community organization record CI consumption is 18,000 minutes in Feb 2022. We are allowed 100,000. So no, Docker-Coq will not be affected. The case of the Coq project is really special since we are spending so much.
Regarding the storage use, it is 6.4 GiB for coq-community including 3.41 GiB for Docker-Coq artifacts and 2.8 GiB for the coq-performance-tests repository (we could stop mirroring it there if needed).
Guillaume Melquiond said:
Michael Soegtrop said:
Maybe it is possible to talk to them and get an exception.
I very much doubt so. Our quota of 50'000 minutes corresponds to their "ultimate" offer, which normally costs $99 per user and which we are currently getting for free. There is not even a higher tier for paying customers. So, I don't see how they could provide an even higher tier for non-paying customers. Also keep in mind that their "free" offer is just 400 minutes, so we are currently getting 125x that. In other words, we are very lucky and we should not push our luck.
The only thing that might be reasonable to ask is for a delay in order to resize our CI infrastructure.
Emilio Jesús Gallego Arias said:
We would only max out a 64-core server on special weeks, for the rest that'd be fine
Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average. Given our memory consumption, it's not clear that 18.5 runners would fit on a single 64-core server (it would need lots of RAM) and this is not even the usage peak, this is the average over a full month, days and nights.
Maxime Dénès said:
Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?
Why should we disable shared runners in case you add VMs without tags? BTW, 8 runners won't be sufficient to entirely replace our shared runner usage if we go back to our full CI according to my calculation above.
Ok, so we'd need a combination of reduced CI + self hosted runners, right?
(and studying a migration to gitlab.inria.fr in parallel, that is talking with the guys about the capacity of their service)
Yes, I think this is a good summary.
Théo Zimmermann said:
Emilio Jesús Gallego Arias said:
We would only max out a 64-core server on special weeks, for the rest that'd be fine
Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average. Given our memory consumption, it's not clear that 18.5 runners would fit on a single 64-core server (it would need lots of RAM) and this is not even the usage peak, this is the average over a full month, days and nights.
Inria runners are considerably faster, so the amount of minutes would be less.
Also a few jobs are responsible of 60% build time (see the analysis when we discussed removing the async job)
Coq Universe provides a good reference IMHO of how much is a full CI, around 1h in a 8 cores machine with 32GiB of RAM
That's without caching
in a Github actions runner, the vo build will take like 2.5 hours
with just 2 cores
with caching it varies
Emilio Jesús Gallego Arias said:
Also a few jobs are responsible of 60% build time (see the analysis when we discussed removing the async job)
If you have such data available, I'm very interested to look at them.
The math is easy, look at the jobs and sort by time
How do you do that?
The original thread is https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs-.26-plugin-devs/topic/CI.20for.20base.2Basync.20timing.20out
Thanks! How did you get these figures? I can't seem to figure out how to extract them for a GitLab pipeline.
Total CI time without the heavy ones: 665
Total CI time with (unimath, vst, and fiat-crypto): 1155
I just cut and paste + emacs macro
link
https://hackmd.io/PGrScXsoQVSLckPu3CXvIw
Give me any pipeline and I will produce the table
Maxime Dénès said:
Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?
no tags imo
Emilio Jesús Gallego Arias said:
Give me any pipeline and I will produce the table
So indeed we can optimize a lot, for example if we use dune, checking compilation of Coq ML part + all plugins is just a couple of minutes
maybe 3, 4 max (depends on the runner)
In my machine 22 secs
the lint job does that and in your table it took 03:37
it also does a bit of other stuff (whitespace checks etc)
It took 3:37 total
Boot time, checkout etc...
I'm talking about the actual command only
There is a lot variabilty in boot time too
in runner boot time
But total cpu time is almost 4 minutes
We should also really do something about the test-suite jobs running forever because they start swapping or whatever.
In this reduced pipeline, the base test-sutie job was running for over 90 minutes before I killed it: https://gitlab.com/coq/coq/-/pipelines/657639626/builds
Can you run your macros on this pipeline anyway?
I'd like a comparison with the FULL_CI ones.
Should we write a PR to get some logging information about the test-suite taking ages?
Cancelling them after a timeout is a work around, but we should get to the root of the issue.
Some have hypothesized swapping. Could we auto-detect if swapping happens and kill the test-suite immediately in that case?
@Théo Zimmermann
| Status | Job | Stage | Name | Duration |
|--------|-----|-------|------|----------|
| passed | #3122706083 pr-16601 0183e2d4 | build | plugin:plugin-tutorial | 00:05:30 |
| passed | #3122706082 pr-16601 0183e2d4 | build | validate:vio | 00:02:32 |
| passed | #3122706080 pr-16601 0183e2d4 | build | validate:base | 00:04:45 |
| passed | #3122706079 pr-16601 0183e2d4 | build | test-suite:edge:dev | 00:25:58 |
| canceled | #3122706078 pr-16601 0183e2d4 | build | test-suite:base | 01:31:22 |
| passed | #3122706075 pr-16601 0183e2d4 | build | doc:ml-api:odoc | 00:04:11 |
| passed | #3122706070 pr-16601 0183e2d4 | build | doc:stdlib | 00:04:29 |
| passed | #3122706064 pr-16601 0183e2d4 | build | doc:refman-pdf | 00:11:08 |
| passed | #3122706061 pr-16601 0183e2d4 | build | doc:refman | 00:07:42 |
| passed | #3122706058 pr-16601 0183e2d4 | build | pkg:nix | 00:21:46 |
| passed | #3122706055 pr-16601 0183e2d4 | build | lint | 00:05:02 |
| passed | #3122706051 pr-16601 0183e2d4 | build | build:vio | 00:25:44 |
| passed | #3122706049 pr-16601 0183e2d4 | build | build:edge+flambda:dev | 00:18:48 |
| passed | #3122706046 pr-16601 0183e2d4 | build | build:base | 00:17:46 |
oh sorry let me sort it
| Status | Job | Stage | Name | Duration |
|----------+-------------------------------+-------+------------------------+----------|
| canceled | #3122706078 pr-16601 0183e2d4 | build | test-suite:base | 01:31:22 |
| passed | #3122706079 pr-16601 0183e2d4 | build | test-suite:edge:dev | 00:25:58 |
| passed | #3122706051 pr-16601 0183e2d4 | build | build:vio | 00:25:44 |
| passed | #3122706058 pr-16601 0183e2d4 | build | pkg:nix | 00:21:46 |
| passed | #3122706049 pr-16601 0183e2d4 | build | build:edge+flambda:dev | 00:18:48 |
| passed | #3122706046 pr-16601 0183e2d4 | build | build:base | 00:17:46 |
| passed | #3122706064 pr-16601 0183e2d4 | build | doc:refman-pdf | 00:11:08 |
| passed | #3122706061 pr-16601 0183e2d4 | build | doc:refman | 00:07:42 |
| passed | #3122706083 pr-16601 0183e2d4 | build | plugin:plugin-tutorial | 00:05:30 |
| passed | #3122706055 pr-16601 0183e2d4 | build | lint | 00:05:02 |
| passed | #3122706080 pr-16601 0183e2d4 | build | validate:base | 00:04:45 |
| passed | #3122706070 pr-16601 0183e2d4 | build | doc:stdlib | 00:04:29 |
| passed | #3122706075 pr-16601 0183e2d4 | build | doc:ml-api:odoc | 00:04:11 |
| passed | #3122706082 pr-16601 0183e2d4 | build | validate:vio | 00:02:32 |
I'd cut the builds except for the flambda + test suite flabmda
build:vo and build:base
and the validate jobs
are a loss of time in a minimal CI
and instead I'd add the composed plugin build
This is somewhat making the CI useless to me at least. I can run the test-suite locally faster than what the CI provides.
I’m just finding this out, shall I try to contact the guys at gitlab open source to tell them about our situation then?
Sure, this is worth trying to at least alert them that we would appreciate some additional delay to sort things out.
Pierre-Marie Pédrot said:
This is somewhat making the CI useless to me at least. I can run the test-suite locally faster than what the CI provides.
I'm not sure I follow PM, of course local computers are faster than the VMs for free
Tho gitlab.inria large instances are much faster
@Emilio Jesús Gallego Arias my typical use of the CI is to run stuff that is annoying to run. If we only keep a tiny subset of the CI, then it becomes trivial because I always run the test-suite locally. By contrast, running all the other tests are typically bothersome because you need some setup, network connection and parallel execution (if you go for the full CI with many devs).
The question is not whether the free VM are slow, the question is the trade-off between the time where my PC is doing something vs. some stuff running overnight on the internet
For a CI I can run in less than 10 minutes on my PC, this is not very clear.
I think we want to get some private runners (or use the gitlab.inria ones if there are enough) then put back most or all the jobs, possibly merging jobs if that exploits runner cpus better (ie coq-universe job instead of a bunch of jobs)
the current job disabling I see as a temporary measure to avoid completely breaking and to keep opam-coq-archive who did nothing wrong working
I sent a mail with a plea for a grace period, let's see what they say.
I added 8 runners, you may want to put back some jobs.
I will now initiate a discussion with our services about the capacity and security of gitlab.inria.fr.
Let's not hurry in putting back some jobs / check how our usage evolves. Since the discussion started less than 24h ago, we have still spend about 8% of our quotas.
It is not obvious that we can use it, because the current model is allowing only sponsored Inria accounts, whereas what we want to do is essentially to mirror arbitrary workloads.
Maxime Dénès said:
It is not obvious that we can use it, because the current model is allowing only sponsored Inria accounts, whereas what we want to do is essentially to mirror arbitrary workloads.
What do you mean?
Aren't we exposing code execution on our runners to anybody?
Anyway let's discuss these aspects privately.
what happened to coq-gitlab-runner2.ci
?
ah ah I was waiting for someone to ask
are you sure you want to know?
yes
should we try disabling shared runners, enable full ci and see how big the queue gets?
Sure, that's a good experiment to make!
And it's also easy to do: no PR to merge, just a variable to set.
the CloudStack Inria API doesn't let you destroy resources completely, you have to wait for a 24h cron to do it
I created a runner numbered 2
then deleted it, so I can't reuse the name yet
similar pb for storage
I don't know why they put this Jenkins-related logic in their Cloud backend, it is a very bad separation of concerns
Théo Zimmermann said:
Sure, that's a good experiment to make!
let's go then
@Maxime Dénès how beefy are the vms you added? should we increase the -j we use?
they have 2 vCPUs, 4 Gb RAM
already queuing in 1 pr https://gitlab.com/coq/coq/-/pipelines/657973015
it was the closest I could find to the instances use by gitlab's shared runners
current -j2 stays then
the runners don't allow ulimit so all the scripts using it are failing, eg https://gitlab.com/coq/coq/-/jobs/3124630681
also https://gitlab.com/coq/coq/-/jobs/3124630609 (ocamlopt fails in schemeequality.v) is interesting considering how that test kept hanging
If we need them to go through, we could deactivate native compilation, that was the only reason for ulimit IIRC.
Gaëtan Gilbert said:
the runners don't allow ulimit
No, they do allow ulimit, but we have forbidden it. We need to modify .gitlab-ci.yml
line 200 and add option -S
.
https://github.com/coq/coq/pull/16612
What should be done with ongoing PRs (in particular those ready to merge)? wait a bit that solutions are settled for CI troubles?
I am not sure I understand what is going on. Is https://gitlab.com/coq/coq/pipelines/new?var[FULL_CI]=true&ref=pr-XXXXX
to be run or not?
@Olivier Laurent at the moment the full CI pipelines are on again by default but that may not last. If you want to merge a PR, just check that the GitLab pipeline status check has the usual long list of job. If yes, it can be merged. If not, it should get a full CI run before.
Maxime Dénès said:
the CloudStack Inria API doesn't let you destroy resources completely, you have to wait for a 24h cron to do it
If there is a need to go faster, you can ask a CI admin to destroy the resource before the cron job (but, of course, it only works if a CI admin is available... I was attending a training course the two last days).
Théo Zimmermann said:
Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average.
After reading the documentation and double-checking on the usage webpage, our CI actually uses 1,600,000 minutes per month, but because we are an open source project, they only count for half of it: "The cost factors for jobs running on shared runners on GitLab.com are 0.5 for public projects in the GitLab for Open Source program." So, this amounts to 37 full-time runners on average.
@Guillaume Melquiond I think you've misunderstood how the usage is reported. There are two tabs: CI/CD minute usage and Shared runner duration. The latter is the real one and the former is how it is counted. Thus, in October we have used 80,000 actual shared runner minutes and they have counted as 40,000. In September, the cost factor was different so we have used 800,000 actual shared runner minutes and they have counted as 32,000.
Our peak was in March: 820,000 actual shared runner minutes.
Right, my mistake. So, we are only using 800,000 minutes, but we still need to divide this usage by 8.
And for the month of October, this is even worse since we have already spent 80% of it.
So we should bring it down to basically zero on the rest of the month (or buy more minutes) if we want the opam-coq-archive CI to remain unaffected.
Disabling the shared runners on the Coq repository (like we did) is a good way of ensuring this.
Currently, there is no queue in Coq CI despite the full CI being on and the shared runners disabled, but that may be related to the ulimit
issue (maybe that make jobs fail and thus stop quicker).
Regarding opam-coq-archive, I am in the process of cutting part of its CI down, because even if it is a few orders of magnitude less than Coq itself, it is still blowing up our quota.
(Almost 20% of the quota after just 5 days.)
We could also consider migrating the opam-coq-archive CI to GitHub Actions. Do you think that it would be a realistic option?
No idea, I don't understand how Github Actions work. (And there is still a quota for those, isn't it?)
No AFAIU, there is no limit to how many CI minutes you can use. There is a limit to how many parallel runners you get (4 IIRC). So this is the latter which would make it impossible to use for Coq's CI without additional GitHub sponsorship.
Ah no, it seems those are unlimited, as long as we don't ask for more than 2 cores.
"those" being?
Github Actions. I was correcting myself.
https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits seems to indicate that the actual limit is 20 parallel runners on Linux / Windows and 5 on macOS.
They do not say the scope but I think it's probably the organization.
A good thing about GitHub Actions is that you get a 6-hour timeout per job, so that's better than what we have on GitLab with the shared runners.
Thierry Martinez said:
If there is a need to go faster, you can ask a CI admin to destroy the resource before the cron job (but, of course, it only works if a CI admin is available... I was attending a training course the two last days).
So you confirm that this 24h delay exists? I received inconsistent answers on the matter. Why bake this CI-related logic in the Cloud backend, and not let us destroy resources that we created ourselves immediately?
The whole point of CloudStack-like tools is to not have to open tickets for routine operations :)
That ci.inria.fr has this kind of business logic I understand, but I'm very surprised that it ends up being restrictions on the CloudStack side.
I can confirm that, a few months ago, this delay did exist. (I had to remove a VM because I was over-quota. The VM removal happened instantly, but the associated storage took 24 hours to disappear.)
has coqbot fallen to the same worker depletion or what? it doesn't react on https://github.com/coq/coq/pull/16611
maybe it's on strike
@Théo Zimmermann have you deactivated coqbot to prevent people merging carelessly w.r.t. the non-full CI?
No, I didn't do anything.
I'll have a look :ambulance:
That's a GitHub issue: https://www.githubstatus.com/
Webhooks are not being delivered correctly.
So coqbot never received these messages.
You can insist (some issue comments did trigger the webhook, but I don't know what is the current success rate) or wait for the webhooks to be delivered (they might be with some delay).
Folks, one problem with disabling the shared runners vs tagging the jobs for our workers is that we can't generate docker images anymore
Yes, this was an experiment, but since it seems like we are going to keep this setup for a while, we should do a tag-based system to keep the shared runners available, but for Docker images only.
I'm gonna enable them briefly so my docker image is rebuilt
It looks like the queue of jobs is increasing..
Where do you see the queue?
Here of course https://gitlab.com/coq/coq/-/jobs ...
I guess we will be back a bit to the times we needed the night for CI to catch up
Once we can estimate our needs, we should be able to order runners on an external Cloud provider if needed.
But I find it also good to try to reduce a bit our usage, because of environmental concerns.
(at least the lowest hanging fruit, like what Théo is implementing in coqbot)
It looks like the queue is still big after the night.
Hopefully, it will catch up now that we are running light CI pipelines by default on PRs.
Btw @Thierry Martinez kindly improved the configuration of Inria's CloudStack, so there's hope we can make quick progress on the way these new runners are orchestrated.
Maxime Dénès said:
Ok, I got something automated enough (snapshots + Ansible playbook). I should be able to add ~8 runners tomorrow.
is your automation public?
That one no, but I'm almost done with a better one, properly orchestrating ephemeral VMs being developed here: https://gitlab.inria.fr/mdenes/docker-machine-cloudstack
I will stabilize it, then replace the current runners with this orchestration. Warning: it is currently fully undocumented and moving and breaking often.
The new runners seem to work fine so far. @Gaëtan Gilbert FYI, they should be able to handle the docker boot job as well.
(not tested yet)
It should be fairly easy to add compute from an external cloud provider using the same orchestration, btw.
Oh, and also: these runners are now available to the coq
group, not just the coq
project. @Théo Zimmermann I think you needed it for another project, but I don't recall exactly.
Not me, the opam-coq-archive maintainers.
Well, technically, they were fine with the shared runners apparently (except for long-running jobs, but that's not new).
@Thierry Martinez Thanks again for your help on this. I will try to document this orchestration and its deployment, in case it is useful to other projects.
About the macos worker, if it is only used for storage occasionally, can I snapshot and delete it? @Théo Zimmermann @Michael Soegtrop
It would probably free resources for two more runners.
:+1: from me if you have the :+1: from Michael
We should probably also think about our usage of the bench machines. Am I right that they are idle most of the time?
We could use them for CI when they are not used for benchmarking.
Absolutely, if you find a way of making sure that bench jobs can "lock" them (and of course they wait until they are fully "locked" before starting running).
@Gaëtan Gilbert Am I right that pendulum is unused these days?
(if so, I can add some runners on it to improve a bit our CI situation)
yes (it's set as paused in the runner list currently)
A couple thoughts:
In the end, we finished October having only used 86% of our CI allowance and 5 days into November we have only used 1% (since shared runners are only enabled on opam-coq-archive). Any opposition to enabling shared runners back on the Coq repo as well and monitoring how it goes? (With the light pipelines by default and our own runners enabled, we do not risk CI minutes running out as quickly as in the beginning of October.)
would it be useful?
Well, sometimes pipelines are waiting quite a lot in queue because of limited runner availability. We also still haven't seen a fully green CI for several weeks because of fiat-crypto and VST failing on the new runners (or has this been solved now?). Furthermore, since these shared runners are free, there would not be much point in not using them if they can help.
VST CI should now be passing on master after https://github.com/coq/coq/pull/16795
but fiat-crypto is still using too much memory, as it seems that its custom makefile ignores the NJOBS option
CI has been broken for more than a month, we should really do something about it
there was the idea to use tags for memory-intensive runs, is it hard to implement?
I don't recall exactly how tags work, but probably we could do something like : tag our runners, tell them to only run tagged jobs, tag all our jobs with the same tag and make fiat-crypto and a few other jobs the only ones without tags, re-enable the shared runners so that we use them for these.
do the shared runners actually have more power? https://github.com/coq/coq/pull/16801#issuecomment-1314584174 sounds like fiat-crypto started using more memory
I think the shared runners actually have less memory than GitHub runners. For example, in the Coq opam archive, we started seeing OOMs more frequently with shared runners
Karl Palmskog said:
I think the shared runners actually have less memory than GitHub runners. For example, in the Coq opam archive, we started seeing OOMs more frequently with shared runners
What do you call "GitHub runners" and what do you call "shared runners" exactly? It looks to me like what you want to say is that you started to see more OOMs with "custom runners" (the ones that were added by Maxime) compared to when we only used "shared runners" (the ones made available by GitLab). If that's correct, then this would match my impression of what has happened on the side of the Coq repo (where shared runners had no trouble with VST and fiat-crypto but the custom runners using the former bench machines did).
right, yes, after Maxime added custom runners, we had more OOMs in the Coq opam archive. I think the old ones had at least 7 GB of memory, now it might be something like 4 GB.
4GB is indeed very tight for building Coq stuff and likely not very efficient because the machine will likely swap (with -j 2).
https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html says "1 vCPU, 3.75GB RAM" (which is what I've seen). medium machines are available for free projects (enable with tags: saas-linux-medium-amd64
), are twice as big, and consume twice as many minutes (https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html#additional-costs-on-gitlab-saas)
Do we know how much total RAM and CPU would be sufficient to run Coq CI reasonably efficiently?
my experience is that 4 cores and something like 10-12 GB RAM works well
but not sure this would be even enough that outliers like Fiat Crypto work as expected
I more meant how much total GB and total CPU we need to run Coq CI efficiently.
Fiat Crypto currently peaks at a bit over 4GB per thread/process. Last I checked, our heaviest (> 2GB) files memory-wise are
1m02.21s | 4379108 ko | Bedrock/End2End/RupicolaCrypto/Derive.vo
3m06.05s | 2930324 ko | rupicola/bedrock2/compiler/src/compiler/FlatToRiscvFunctions.vo
2m02.81s | 2751688 ko | rupicola/bedrock2/compiler/src/compiler/FlatToRiscvMetric.vo
5m30.49s | 2649632 ko | Bedrock/Field/Synthesis/Examples/p224_64_new.vo
10m08.34s | 2582048 ko | Bedrock/End2End/X25519/GarageDoor.vo
4m17.69s | 2500784 ko | Assembly/WithBedrock/Proofs.vo
1m49.62s | 2252340 ko | Fancy/Barrett256.vo
0m34.38s | 2216436 ko | ExtractionOCaml/bedrock2_solinas_reduction.ml
0m33.70s | 2199600 ko | ExtractionOCaml/with_bedrock2_word_by_word_montgomery.ml
0m32.69s | 2199516 ko | ExtractionOCaml/bedrock2_word_by_word_montgomery.ml
0m39.68s | 2147996 ko | ExtractionOCaml/word_by_word_montgomery
0m44.54s | 2147880 ko | ExtractionOCaml/bedrock2_solinas_reduction
0m43.08s | 2147548 ko | ExtractionOCaml/solinas_reduction
1m56.73s | 2140160 ko | fiat-rust/src/p384_scalar_32.rs
1m56.65s | 2139664 ko | fiat-java/src/FiatP384Scalar.java
0m32.10s | 2130396 ko | ExtractionOCaml/solinas_reduction.ml
0m16.34s | 2106988 ko | ExtractionHaskell/bedrock2_solinas_reduction.hs
0m16.60s | 2106956 ko | ExtractionHaskell/with_bedrock2_solinas_reduction.hs
0m31.79s | 2095372 ko | ExtractionOCaml/word_by_word_montgomery.ml
0m29.60s | 2092128 ko | ExtractionOCaml/with_bedrock2_unsaturated_solinas.ml
0m30.75s | 2091876 ko | ExtractionOCaml/bedrock2_unsaturated_solinas.ml
1m56.18s | 2088112 ko | fiat-c/src/p384_scalar_32.c
0m42.31s | 2088052 ko | ExtractionOCaml/bedrock2_word_by_word_montgomery
0m41.92s | 2084496 ko | ExtractionOCaml/with_bedrock2_word_by_word_montgomery
4m29.40s | 2077784 ko | Rewriter/Passes/ArithWithCasts.vo
7m09.75s | 2070516 ko | Curves/Weierstrass/AffineProofs.vo
1m56.19s | 2056320 ko | fiat-go/32/p384/p384.go
1m30.01s | 2053492 ko | SlowPrimeSynthesisExamples.vo
If we're careful enough, 2GB--3GB per coq process is enough for most fiat-crypto files. I think much of fiat-crypto can usefully use 6--8 cores in parallel (and I've gone through some pains to achieve good file-level parallelism), though we are sometimes blocked on single slow files (like the 10m GarageDoor file that depends on almost everything else in the repo). I think with 14 GB RAM fiat-crypto could safely be built with -j4 -async-proofs-j 1
, which would be pretty good. I think 8 GB RAM might just barely be enough for -j2 -async-proofs-j 1
.
Karl Palmskog said:
right, yes, after Maxime added custom runners, we had more OOMs in the Coq opam archive. I think the old ones had at least 7 GB of memory, now it might be something like 4 GB.
Which tags were you using?
AFAICT, the default gitlab shared runner has 3.75 Gb RAM (https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html)
Oh sorry, I see that @Paolo Giarrusso pointed to the same link
So I'm a bit lost: did I fail to reproduce the same setup as before?
Indeed, for some reason the setup is not the same as before AFAIU, but I don't know why.
@Maxime Dénès @Gaëtan Gilbert Would it be fine to re-enable shared runners for the Coq repository to boost the PR processing cycle during the release process? I can monitor closely our usage and we can decide upfront a limit that we do not want to go over (e.g., 30%).
do as you like, just tell us when you change something
we were fine wasting many percent last month so 30% is pretty conservative imo
Done.
So we are indeed burning CI minutes at a fast rate with shared runners on. We jumped from about 6% use this morning to 26%.
We've already reached 34% use. I've disabled the shared runners. Apparently, this doesn't stop the current running jobs (which is rather good in this case).
I'm adding more runners
what's our current strategy for jobs that need priviledged docker? https://gitlab.com/coq/coq/-/jobs/3433179724 is stuck
The strategy I suggested was to save our shared runner minutes for them
ok I'm enabling shared runners until they pick up the job then disabling them
it got picked up fast wow
Maxime Dénès said:
I'm adding more runners
where are they coming from btw?
Inria's CloudStack
We got an extra allowance?
?
Ah no, these were the 8 stopped runners, right?
Does it mean that coq-docker-machine-driver
should be removed?
Also, it would be really good to get documentation of this infrastructure since it looks like it's going to remain like this indefinitely. This would be helpful so that others that you Maxime can act when there are maintenance issues.
Théo Zimmermann said:
Does it mean that
coq-docker-machine-driver
should be removed?
Yes.
@Maxime Dénès A job failed with "no space left on device". I think that's on one of the new runners: https://gitlab.com/coq/coq/-/jobs/3451330182
Last updated: Dec 07 2023 at 14:02 UTC