They have updated the doc, so it looks like a change on purpose.
But sure, if there were no announcements, we should probably contact them to say that we need time to adjust.
I personally haven't researched if there were any announcement or twitter discussion about this.
We would only max out a 64-core server on special weeks, for the rest that'd be fine
Pierre-Marie Pédrot said:
Honestly I don't see what could go wrong with deactivating everything. Thanks to the magic of VCS, worst case we can revert or tweak the result.
Not sure I understand what you propose here. Deactivating everything means we don't have CI, so who's gonna do the testing / revert when a PR lands and a breaks everything
If I understand currently, as the PR has been merged, we should block PR merge too?
There is a "talk to an expert" button on https://about.gitlab.com/sales/. Can somebody ask them what is going on?
something like this: (I won't send it)
Hi, you seem to have changed how the minutes for the runners are being calculated. The doc also reflects this change.
I find this somewhat absurd as there was no announcement to the said change and it is heavily affecting our development CI pipeline.
Since the Coq organization is a paying for the runner services, it seems somewhat predatory that the calculations for minutes can be changed as such.
Please clarify the situation.
- Coq Devleopment team
we're not paying afaik
Like @Guillaume Melquiond, I'm not confident at all that Inria's Gitlab CI will handle the load, especially if several large projects migrate simultaneously
What I would do, is in parallel:
I can initiate both.
By the way, I am noticing that we are currently using 4TB (!) of storage on gitlab.com. So, runners might not be our only issue.
I think we can sequence the operations:
I agree it's better to not rush 2 too much. Doing a bit of temporary work for 1 does not seem problematic to me.
To put things into perspective, in three weeks (if I understand correctly), the cost will be $6 per GB, so let us be ready for a massive bill.
Yes, we need to address it. Still, TV news are not mentioning that we have CI issues, so I guess we can take the time to do things in order :)
@Guillaume Melquiond Do you know what this space is used for? Do we store artifacts for all builds?
And don't we have a retention policy?
It seems we are using 3894.7 GiB for job artifacts.
https://gitlab.com/groups/coq/-/usage_quotas
build:base is 300MB, then we have all the other stuff so multiple GB / PR I guess
Let's create a channel for communication actions that we take on this CI issue. We will use this one to discuss about the situation and what should be done.
Was there some communication done to coq devs about the process they should currently follow to merge PRs?
also artifacts for the latest commit on a branch are never deleted
with our coqbot system that means artifacts for open PRs
so for instance this 11month old job still has 270MB of artifacts and then there's the other jobs in the pipeline https://gitlab.com/coq/coq/-/jobs/1739082131
yeah that's what I meant by retention policy
that's actually a setting https://docs.gitlab.com/ee/ci/pipelines/job_artifacts.html#keep-artifacts-from-most-recent-successful-jobs so maybe we should unset it
Should I refrain from pushing to PRs for the time being?
@Gaëtan Gilbert Should @Ali Caglayan rebase his PRs to get only the reduced CI or not?
The PR I am thinking about already has a lot of the jobs removed, but I am unsure how severe the situation is today. I don't want to use up resources that could be better used for others.
Maxime Dénès said:
Gaëtan Gilbert Should Ali Caglayan rebase his PRs to get only the reduced CI or not?
no need to rebase, coqbot auto merges
Ali Caglayan said:
Should I refrain from pushing to PRs for the time being?
push on, it will help tell us if we reduced enough
Looks reduced https://gitlab.com/coq/coq/-/pipelines/657296145
I came a bit late to this thread. I guess Coq Platform is not affected since I am using only GitHub actions?
I guess so, this seems to be a gitlab.com policy change.
That's very bad news :-(
BTW regarding the docker-coq infrastructure in coq-community: the docker-keeper tooling specifically relies on GitLab CI.
Migrating to GitHub Actions is just not feasible (e.g. YAML anchors are not supported by GHA, nor the generation of children pipelines from a dynamic YAML is not supported…)
Fortunately, the auto-build/deployment of coqorg/*
images should not be as costly in CI/CD minutes as PRs pipelines.
But I cannot check the quotas myself in https://gitlab.com/groups/coq-community/-/usage_quotas#pipelines-quota-tab as only @Théo Zimmermann can → you may want to take a look at the row for coq-community/docker-coq
, in order to extrapolate
Here is the GitLab issue with the changes: https://gitlab.com/gitlab-org/gitlab/-/issues/337245
I googled a bit more and, according to this blog article https://about.gitlab.com/blog/2021/11/11/public-project-minute-limits/ that I had not spotted before, this drastic change is intended to fight cryptomining via CI pipelines.
But that's a pity that FOSS projects like Coq which consume much more pipeline time that the average of other projects, are badly impacted by these changes… Maybe, reaching them by the email opensource@gitlab.com mentioned in https://about.gitlab.com/handbook/marketing/community-relations/opensource-program/ to just let us know (?)
my 2 cents…
Finally FYI, it seems that there's another thing to double-check now:
the Namespace Storage Quota: https://gitlab.com/groups/math-comp/-/usage_quotas#storage-quota-tab
as according to https://docs.gitlab.com/ee/user/usage_quotas.html#namespace-storage-limit-enforcement-schedule
Storage limits for GitLab SaaS Free tier namespaces will not be enforced prior to 2022-10-19. Storage limits for GitLab SaaS Paid tier namespaces will not be enforced for prior to 2023-02-15.
IINM, coq
and coq-community
are not impacted for now, unlike math-comp
, which is still in the Free Tier…
(see e.g. this topic)
I don't quite see how this change keeps people from crypto mining in CI if forks remain at 0.008 - who prevents people of forking their own projects? I guess one could find more intelligent methods for preventing such abuse.
Maybe it is possible to talk to them and get an exception.
Ok, I got something automated enough (snapshots + Ansible playbook). I should be able to add ~8 runners tomorrow.
Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?
Michael Soegtrop said:
Maybe it is possible to talk to them and get an exception.
I very much doubt so. Our quota of 50'000 minutes corresponds to their "ultimate" offer, which normally costs $99 per user and which we are currently getting for free. There is not even a higher tier for paying customers. So, I don't see how they could provide an even higher tier for non-paying customers. Also keep in mind that their "free" offer is just 400 minutes, so we are currently getting 125x that. In other words, we are very lucky and we should not push our luck.
Maxime Dénès said:
Was there some communication done to coq devs about the process they should currently follow to merge PRs?
Communication happened by this message: https://github.com/coq/coq/pull/16600#issuecomment-1265773392. I know it's not ideal, but any dev with push right should have got it.
Gaëtan Gilbert said:
that's actually a setting https://docs.gitlab.com/ee/ci/pipelines/job_artifacts.html#keep-artifacts-from-most-recent-successful-jobs so maybe we should unset it
Yes, we should definitely use it. Something changed recently? It didn't use to be the case that old PRs have their artifacts kept.
Erik Martin-Dorel said:
Fortunately, the auto-build/deployment of
coqorg/*
images should not be as costly in CI/CD minutes as PRs pipelines.
But I cannot check the quotas myself in https://gitlab.com/groups/coq-community/-/usage_quotas#pipelines-quota-tab as only Théo Zimmermann can → you may want to take a look at the row forcoq-community / docker-coq
, in order to extrapolate
The entire coq-community organization record CI consumption is 18,000 minutes in Feb 2022. We are allowed 100,000. So no, Docker-Coq will not be affected. The case of the Coq project is really special since we are spending so much.
Regarding the storage use, it is 6.4 GiB for coq-community including 3.41 GiB for Docker-Coq artifacts and 2.8 GiB for the coq-performance-tests repository (we could stop mirroring it there if needed).
Guillaume Melquiond said:
Michael Soegtrop said:
Maybe it is possible to talk to them and get an exception.
I very much doubt so. Our quota of 50'000 minutes corresponds to their "ultimate" offer, which normally costs $99 per user and which we are currently getting for free. There is not even a higher tier for paying customers. So, I don't see how they could provide an even higher tier for non-paying customers. Also keep in mind that their "free" offer is just 400 minutes, so we are currently getting 125x that. In other words, we are very lucky and we should not push our luck.
The only thing that might be reasonable to ask is for a delay in order to resize our CI infrastructure.
Emilio Jesús Gallego Arias said:
We would only max out a 64-core server on special weeks, for the rest that'd be fine
Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average. Given our memory consumption, it's not clear that 18.5 runners would fit on a single 64-core server (it would need lots of RAM) and this is not even the usage peak, this is the average over a full month, days and nights.
Maxime Dénès said:
Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?
Why should we disable shared runners in case you add VMs without tags? BTW, 8 runners won't be sufficient to entirely replace our shared runner usage if we go back to our full CI according to my calculation above.
Ok, so we'd need a combination of reduced CI + self hosted runners, right?
(and studying a migration to gitlab.inria.fr in parallel, that is talking with the guys about the capacity of their service)
Yes, I think this is a good summary.
Théo Zimmermann said:
Emilio Jesús Gallego Arias said:
We would only max out a 64-core server on special weeks, for the rest that'd be fine
Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average. Given our memory consumption, it's not clear that 18.5 runners would fit on a single 64-core server (it would need lots of RAM) and this is not even the usage peak, this is the average over a full month, days and nights.
Inria runners are considerably faster, so the amount of minutes would be less.
Also a few jobs are responsible of 60% build time (see the analysis when we discussed removing the async job)
Coq Universe provides a good reference IMHO of how much is a full CI, around 1h in a 8 cores machine with 32GiB of RAM
That's without caching
in a Github actions runner, the vo build will take like 2.5 hours
with just 2 cores
with caching it varies
Emilio Jesús Gallego Arias said:
Also a few jobs are responsible of 60% build time (see the analysis when we discussed removing the async job)
If you have such data available, I'm very interested to look at them.
The math is easy, look at the jobs and sort by time
How do you do that?
The original thread is https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs-.26-plugin-devs/topic/CI.20for.20base.2Basync.20timing.20out
Thanks! How did you get these figures? I can't seem to figure out how to extract them for a GitLab pipeline.
Total CI time without the heavy ones: 665
Total CI time with (unimath, vst, and fiat-crypto): 1155
I just cut and paste + emacs macro
link
https://hackmd.io/PGrScXsoQVSLckPu3CXvIw
Give me any pipeline and I will produce the table
Maxime Dénès said:
Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?
no tags imo
Emilio Jesús Gallego Arias said:
Give me any pipeline and I will produce the table
So indeed we can optimize a lot, for example if we use dune, checking compilation of Coq ML part + all plugins is just a couple of minutes
maybe 3, 4 max (depends on the runner)
In my machine 22 secs
the lint job does that and in your table it took 03:37
it also does a bit of other stuff (whitespace checks etc)
It took 3:37 total
Boot time, checkout etc...
I'm talking about the actual command only
There is a lot variabilty in boot time too
in runner boot time
But total cpu time is almost 4 minutes
We should also really do something about the test-suite jobs running forever because they start swapping or whatever.
In this reduced pipeline, the base test-sutie job was running for over 90 minutes before I killed it: https://gitlab.com/coq/coq/-/pipelines/657639626/builds
Can you run your macros on this pipeline anyway?
I'd like a comparison with the FULL_CI ones.
Should we write a PR to get some logging information about the test-suite taking ages?
Cancelling them after a timeout is a work around, but we should get to the root of the issue.
Some have hypothesized swapping. Could we auto-detect if swapping happens and kill the test-suite immediately in that case?
@Théo Zimmermann
| Status | Job | Stage | Name | Duration |
|--------|-----|-------|------|----------|
| passed | #3122706083 pr-16601 0183e2d4 | build | plugin:plugin-tutorial | 00:05:30 |
| passed | #3122706082 pr-16601 0183e2d4 | build | validate:vio | 00:02:32 |
| passed | #3122706080 pr-16601 0183e2d4 | build | validate:base | 00:04:45 |
| passed | #3122706079 pr-16601 0183e2d4 | build | test-suite:edge:dev | 00:25:58 |
| canceled | #3122706078 pr-16601 0183e2d4 | build | test-suite:base | 01:31:22 |
| passed | #3122706075 pr-16601 0183e2d4 | build | doc:ml-api:odoc | 00:04:11 |
| passed | #3122706070 pr-16601 0183e2d4 | build | doc:stdlib | 00:04:29 |
| passed | #3122706064 pr-16601 0183e2d4 | build | doc:refman-pdf | 00:11:08 |
| passed | #3122706061 pr-16601 0183e2d4 | build | doc:refman | 00:07:42 |
| passed | #3122706058 pr-16601 0183e2d4 | build | pkg:nix | 00:21:46 |
| passed | #3122706055 pr-16601 0183e2d4 | build | lint | 00:05:02 |
| passed | #3122706051 pr-16601 0183e2d4 | build | build:vio | 00:25:44 |
| passed | #3122706049 pr-16601 0183e2d4 | build | build:edge+flambda:dev | 00:18:48 |
| passed | #3122706046 pr-16601 0183e2d4 | build | build:base | 00:17:46 |
oh sorry let me sort it
| Status | Job | Stage | Name | Duration |
|----------+-------------------------------+-------+------------------------+----------|
| canceled | #3122706078 pr-16601 0183e2d4 | build | test-suite:base | 01:31:22 |
| passed | #3122706079 pr-16601 0183e2d4 | build | test-suite:edge:dev | 00:25:58 |
| passed | #3122706051 pr-16601 0183e2d4 | build | build:vio | 00:25:44 |
| passed | #3122706058 pr-16601 0183e2d4 | build | pkg:nix | 00:21:46 |
| passed | #3122706049 pr-16601 0183e2d4 | build | build:edge+flambda:dev | 00:18:48 |
| passed | #3122706046 pr-16601 0183e2d4 | build | build:base | 00:17:46 |
| passed | #3122706064 pr-16601 0183e2d4 | build | doc:refman-pdf | 00:11:08 |
| passed | #3122706061 pr-16601 0183e2d4 | build | doc:refman | 00:07:42 |
| passed | #3122706083 pr-16601 0183e2d4 | build | plugin:plugin-tutorial | 00:05:30 |
| passed | #3122706055 pr-16601 0183e2d4 | build | lint | 00:05:02 |
| passed | #3122706080 pr-16601 0183e2d4 | build | validate:base | 00:04:45 |
| passed | #3122706070 pr-16601 0183e2d4 | build | doc:stdlib | 00:04:29 |
| passed | #3122706075 pr-16601 0183e2d4 | build | doc:ml-api:odoc | 00:04:11 |
| passed | #3122706082 pr-16601 0183e2d4 | build | validate:vio | 00:02:32 |
I'd cut the builds except for the flambda + test suite flabmda
build:vo and build:base
and the validate jobs
are a loss of time in a minimal CI
and instead I'd add the composed plugin build
This is somewhat making the CI useless to me at least. I can run the test-suite locally faster than what the CI provides.
I’m just finding this out, shall I try to contact the guys at gitlab open source to tell them about our situation then?
Sure, this is worth trying to at least alert them that we would appreciate some additional delay to sort things out.
Pierre-Marie Pédrot said:
This is somewhat making the CI useless to me at least. I can run the test-suite locally faster than what the CI provides.
I'm not sure I follow PM, of course local computers are faster than the VMs for free
Tho gitlab.inria large instances are much faster
@Emilio Jesús Gallego Arias my typical use of the CI is to run stuff that is annoying to run. If we only keep a tiny subset of the CI, then it becomes trivial because I always run the test-suite locally. By contrast, running all the other tests are typically bothersome because you need some setup, network connection and parallel execution (if you go for the full CI with many devs).
The question is not whether the free VM are slow, the question is the trade-off between the time where my PC is doing something vs. some stuff running overnight on the internet
For a CI I can run in less than 10 minutes on my PC, this is not very clear.
I think we want to get some private runners (or use the gitlab.inria ones if there are enough) then put back most or all the jobs, possibly merging jobs if that exploits runner cpus better (ie coq-universe job instead of a bunch of jobs)
the current job disabling I see as a temporary measure to avoid completely breaking and to keep opam-coq-archive who did nothing wrong working
I sent a mail with a plea for a grace period, let's see what they say.
also I'm unchecking the "Keep artifacts from most recent successful jobs" setting, I don't think it's especially useful to have for us
We should keep them for the bench though.
I added 8 runners, you may want to put back some jobs.
I will now initiate a discussion with our services about the capacity and security of gitlab.inria.fr.
Let's not hurry in putting back some jobs / check how our usage evolves. Since the discussion started less than 24h ago, we have still spend about 8% of our quotas.
It is not obvious that we can use it, because the current model is allowing only sponsored Inria accounts, whereas what we want to do is essentially to mirror arbitrary workloads.
Pierre-Marie Pédrot said:
We should keep them for the bench though.
why?
Maxime Dénès said:
It is not obvious that we can use it, because the current model is allowing only sponsored Inria accounts, whereas what we want to do is essentially to mirror arbitrary workloads.
What do you mean?
Aren't we exposing code execution on our runners to anybody?
Anyway let's discuss these aspects privately.
what happened to coq-gitlab-runner2.ci
?
ah ah I was waiting for someone to ask
are you sure you want to know?
yes
should we try disabling shared runners, enable full ci and see how big the queue gets?
Sure, that's a good experiment to make!
And it's also easy to do: no PR to merge, just a variable to set.
the CloudStack Inria API doesn't let you destroy resources completely, you have to wait for a 24h cron to do it
I created a runner numbered 2
then deleted it, so I can't reuse the name yet
similar pb for storage
@Gaëtan Gilbert I want the bench artifacts to have a more precise look at the timings. There are HTML files to display visually the results, and furthermore I've got a bunch of scripts to parse this and extract relevant data.
I don't know why they put this Jenkins-related logic in their Cloud backend, it is a very bad separation of concerns
Théo Zimmermann said:
Sure, that's a good experiment to make!
let's go then
Pierre-Marie Pédrot said:
Gaëtan Gilbert I want the bench artifacts to have a more precise look at the timings. There are HTML files to display visually the results, and furthermore I've got a bunch of scripts to parse this and extract relevant data.
Gaëtan is not removing artifact for all branches, he's just disabling artifacts for stale branches AFAIU.
I would say enabling normal expiration times rather than disabling but I guess not a lot of difference
@Maxime Dénès how beefy are the vms you added? should we increase the -j we use?
they have 2 vCPUs, 4 Gb RAM
already queuing in 1 pr https://gitlab.com/coq/coq/-/pipelines/657973015
it was the closest I could find to the instances use by gitlab's shared runners
current -j2 stays then
the runners don't allow ulimit so all the scripts using it are failing, eg https://gitlab.com/coq/coq/-/jobs/3124630681
also https://gitlab.com/coq/coq/-/jobs/3124630609 (ocamlopt fails in schemeequality.v) is interesting considering how that test kept hanging
If we need them to go through, we could deactivate native compilation, that was the only reason for ulimit IIRC.
Gaëtan Gilbert said:
the runners don't allow ulimit
No, they do allow ulimit, but we have forbidden it. We need to modify .gitlab-ci.yml
line 200 and add option -S
.
https://github.com/coq/coq/pull/16612
What should be done with ongoing PRs (in particular those ready to merge)? wait a bit that solutions are settled for CI troubles?
I am not sure I understand what is going on. Is https://gitlab.com/coq/coq/pipelines/new?var[FULL_CI]=true&ref=pr-XXXXX
to be run or not?
@Olivier Laurent at the moment the full CI pipelines are on again by default but that may not last. If you want to merge a PR, just check that the GitLab pipeline status check has the usual long list of job. If yes, it can be merged. If not, it should get a full CI run before.
Maxime Dénès said:
the CloudStack Inria API doesn't let you destroy resources completely, you have to wait for a 24h cron to do it
If there is a need to go faster, you can ask a CI admin to destroy the resource before the cron job (but, of course, it only works if a CI admin is available... I was attending a training course the two last days).
Théo Zimmermann said:
Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average.
After reading the documentation and double-checking on the usage webpage, our CI actually uses 1,600,000 minutes per month, but because we are an open source project, they only count for half of it: "The cost factors for jobs running on shared runners on GitLab.com are 0.5 for public projects in the GitLab for Open Source program." So, this amounts to 37 full-time runners on average.
@Guillaume Melquiond I think you've misunderstood how the usage is reported. There are two tabs: CI/CD minute usage and Shared runner duration. The latter is the real one and the former is how it is counted. Thus, in October we have used 80,000 actual shared runner minutes and they have counted as 40,000. In September, the cost factor was different so we have used 800,000 actual shared runner minutes and they have counted as 32,000.
Our peak was in March: 820,000 actual shared runner minutes.
Right, my mistake. So, we are only using 800,000 minutes, but we still need to divide this usage by 8.
And for the month of October, this is even worse since we have already spent 80% of it.
So we should bring it down to basically zero on the rest of the month (or buy more minutes) if we want the opam-coq-archive CI to remain unaffected.
Disabling the shared runners on the Coq repository (like we did) is a good way of ensuring this.
Currently, there is no queue in Coq CI despite the full CI being on and the shared runners disabled, but that may be related to the ulimit
issue (maybe that make jobs fail and thus stop quicker).
Regarding opam-coq-archive, I am in the process of cutting part of its CI down, because even if it is a few orders of magnitude less than Coq itself, it is still blowing up our quota.
(Almost 20% of the quota after just 5 days.)
We could also consider migrating the opam-coq-archive CI to GitHub Actions. Do you think that it would be a realistic option?
No idea, I don't understand how Github Actions work. (And there is still a quota for those, isn't it?)
No AFAIU, there is no limit to how many CI minutes you can use. There is a limit to how many parallel runners you get (4 IIRC). So this is the latter which would make it impossible to use for Coq's CI without additional GitHub sponsorship.
Ah no, it seems those are unlimited, as long as we don't ask for more than 2 cores.
"those" being?
Github Actions. I was correcting myself.
https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits seems to indicate that the actual limit is 20 parallel runners on Linux / Windows and 5 on macOS.
They do not say the scope but I think it's probably the organization.
A good thing about GitHub Actions is that you get a 6-hour timeout per job, so that's better than what we have on GitLab with the shared runners.
Thierry Martinez said:
If there is a need to go faster, you can ask a CI admin to destroy the resource before the cron job (but, of course, it only works if a CI admin is available... I was attending a training course the two last days).
So you confirm that this 24h delay exists? I received inconsistent answers on the matter. Why bake this CI-related logic in the Cloud backend, and not let us destroy resources that we created ourselves immediately?
The whole point of CloudStack-like tools is to not have to open tickets for routine operations :)
That ci.inria.fr has this kind of business logic I understand, but I'm very surprised that it ends up being restrictions on the CloudStack side.
I can confirm that, a few months ago, this delay did exist. (I had to remove a VM because I was over-quota. The VM removal happened instantly, but the associated storage took 24 hours to disappear.)
has coqbot fallen to the same worker depletion or what? it doesn't react on https://github.com/coq/coq/pull/16611
maybe it's on strike
@Théo Zimmermann have you deactivated coqbot to prevent people merging carelessly w.r.t. the non-full CI?
No, I didn't do anything.
I'll have a look :ambulance:
That's a GitHub issue: https://www.githubstatus.com/
Webhooks are not being delivered correctly.
So coqbot never received these messages.
You can insist (some issue comments did trigger the webhook, but I don't know what is the current success rate) or wait for the webhooks to be delivered (they might be with some delay).
Folks, one problem with disabling the shared runners vs tagging the jobs for our workers is that we can't generate docker images anymore
Yes, this was an experiment, but since it seems like we are going to keep this setup for a while, we should do a tag-based system to keep the shared runners available, but for Docker images only.
I'm gonna enable them briefly so my docker image is rebuilt
It looks like the queue of jobs is increasing..
Where do you see the queue?
Here of course https://gitlab.com/coq/coq/-/jobs ...
I guess we will be back a bit to the times we needed the night for CI to catch up
Once we can estimate our needs, we should be able to order runners on an external Cloud provider if needed.
But I find it also good to try to reduce a bit our usage, because of environmental concerns.
(at least the lowest hanging fruit, like what Théo is implementing in coqbot)
It looks like the queue is still big after the night.
Hopefully, it will catch up now that we are running light CI pipelines by default on PRs.
Btw @Thierry Martinez kindly improved the configuration of Inria's CloudStack, so there's hope we can make quick progress on the way these new runners are orchestrated.
Maxime Dénès said:
Ok, I got something automated enough (snapshots + Ansible playbook). I should be able to add ~8 runners tomorrow.
is your automation public?
That one no, but I'm almost done with a better one, properly orchestrating ephemeral VMs being developed here: https://gitlab.inria.fr/mdenes/docker-machine-cloudstack
I will stabilize it, then replace the current runners with this orchestration. Warning: it is currently fully undocumented and moving and breaking often.
The new runners seem to work fine so far. @Gaëtan Gilbert FYI, they should be able to handle the docker boot job as well.
(not tested yet)
It should be fairly easy to add compute from an external cloud provider using the same orchestration, btw.
Oh, and also: these runners are now available to the coq
group, not just the coq
project. @Théo Zimmermann I think you needed it for another project, but I don't recall exactly.
Not me, the opam-coq-archive maintainers.
Well, technically, they were fine with the shared runners apparently (except for long-running jobs, but that's not new).
@Thierry Martinez Thanks again for your help on this. I will try to document this orchestration and its deployment, in case it is useful to other projects.
About the macos worker, if it is only used for storage occasionally, can I snapshot and delete it? @Théo Zimmermann @Michael Soegtrop
It would probably free resources for two more runners.
:+1: from me if you have the :+1: from Michael
We should probably also think about our usage of the bench machines. Am I right that they are idle most of the time?
We could use them for CI when they are not used for benchmarking.
Absolutely, if you find a way of making sure that bench jobs can "lock" them (and of course they wait until they are fully "locked" before starting running).
@Gaëtan Gilbert Am I right that pendulum is unused these days?
(if so, I can add some runners on it to improve a bit our CI situation)
yes (it's set as paused in the runner list currently)
A couple thoughts:
In the end, we finished October having only used 86% of our CI allowance and 5 days into November we have only used 1% (since shared runners are only enabled on opam-coq-archive). Any opposition to enabling shared runners back on the Coq repo as well and monitoring how it goes? (With the light pipelines by default and our own runners enabled, we do not risk CI minutes running out as quickly as in the beginning of October.)
would it be useful?
Well, sometimes pipelines are waiting quite a lot in queue because of limited runner availability. We also still haven't seen a fully green CI for several weeks because of fiat-crypto and VST failing on the new runners (or has this been solved now?). Furthermore, since these shared runners are free, there would not be much point in not using them if they can help.
VST CI should now be passing on master after https://github.com/coq/coq/pull/16795
but fiat-crypto is still using too much memory, as it seems that its custom makefile ignores the NJOBS option
CI has been broken for more than a month, we should really do something about it
there was the idea to use tags for memory-intensive runs, is it hard to implement?
I don't recall exactly how tags work, but probably we could do something like : tag our runners, tell them to only run tagged jobs, tag all our jobs with the same tag and make fiat-crypto and a few other jobs the only ones without tags, re-enable the shared runners so that we use them for these.
do the shared runners actually have more power? https://github.com/coq/coq/pull/16801#issuecomment-1314584174 sounds like fiat-crypto started using more memory
I think the shared runners actually have less memory than GitHub runners. For example, in the Coq opam archive, we started seeing OOMs more frequently with shared runners
Karl Palmskog said:
I think the shared runners actually have less memory than GitHub runners. For example, in the Coq opam archive, we started seeing OOMs more frequently with shared runners
What do you call "GitHub runners" and what do you call "shared runners" exactly? It looks to me like what you want to say is that you started to see more OOMs with "custom runners" (the ones that were added by Maxime) compared to when we only used "shared runners" (the ones made available by GitLab). If that's correct, then this would match my impression of what has happened on the side of the Coq repo (where shared runners had no trouble with VST and fiat-crypto but the custom runners using the former bench machines did).
right, yes, after Maxime added custom runners, we had more OOMs in the Coq opam archive. I think the old ones had at least 7 GB of memory, now it might be something like 4 GB.
4GB is indeed very tight for building Coq stuff and likely not very efficient because the machine will likely swap (with -j 2).
https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html says "1 vCPU, 3.75GB RAM" (which is what I've seen). medium machines are available for free projects (enable with tags: saas-linux-medium-amd64
), are twice as big, and consume twice as many minutes (https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html#additional-costs-on-gitlab-saas)
Do we know how much total RAM and CPU would be sufficient to run Coq CI reasonably efficiently?
my experience is that 4 cores and something like 10-12 GB RAM works well
but not sure this would be even enough that outliers like Fiat Crypto work as expected
I more meant how much total GB and total CPU we need to run Coq CI efficiently.
Fiat Crypto currently peaks at a bit over 4GB per thread/process. Last I checked, our heaviest (> 2GB) files memory-wise are
1m02.21s | 4379108 ko | Bedrock/End2End/RupicolaCrypto/Derive.vo
3m06.05s | 2930324 ko | rupicola/bedrock2/compiler/src/compiler/FlatToRiscvFunctions.vo
2m02.81s | 2751688 ko | rupicola/bedrock2/compiler/src/compiler/FlatToRiscvMetric.vo
5m30.49s | 2649632 ko | Bedrock/Field/Synthesis/Examples/p224_64_new.vo
10m08.34s | 2582048 ko | Bedrock/End2End/X25519/GarageDoor.vo
4m17.69s | 2500784 ko | Assembly/WithBedrock/Proofs.vo
1m49.62s | 2252340 ko | Fancy/Barrett256.vo
0m34.38s | 2216436 ko | ExtractionOCaml/bedrock2_solinas_reduction.ml
0m33.70s | 2199600 ko | ExtractionOCaml/with_bedrock2_word_by_word_montgomery.ml
0m32.69s | 2199516 ko | ExtractionOCaml/bedrock2_word_by_word_montgomery.ml
0m39.68s | 2147996 ko | ExtractionOCaml/word_by_word_montgomery
0m44.54s | 2147880 ko | ExtractionOCaml/bedrock2_solinas_reduction
0m43.08s | 2147548 ko | ExtractionOCaml/solinas_reduction
1m56.73s | 2140160 ko | fiat-rust/src/p384_scalar_32.rs
1m56.65s | 2139664 ko | fiat-java/src/FiatP384Scalar.java
0m32.10s | 2130396 ko | ExtractionOCaml/solinas_reduction.ml
0m16.34s | 2106988 ko | ExtractionHaskell/bedrock2_solinas_reduction.hs
0m16.60s | 2106956 ko | ExtractionHaskell/with_bedrock2_solinas_reduction.hs
0m31.79s | 2095372 ko | ExtractionOCaml/word_by_word_montgomery.ml
0m29.60s | 2092128 ko | ExtractionOCaml/with_bedrock2_unsaturated_solinas.ml
0m30.75s | 2091876 ko | ExtractionOCaml/bedrock2_unsaturated_solinas.ml
1m56.18s | 2088112 ko | fiat-c/src/p384_scalar_32.c
0m42.31s | 2088052 ko | ExtractionOCaml/bedrock2_word_by_word_montgomery
0m41.92s | 2084496 ko | ExtractionOCaml/with_bedrock2_word_by_word_montgomery
4m29.40s | 2077784 ko | Rewriter/Passes/ArithWithCasts.vo
7m09.75s | 2070516 ko | Curves/Weierstrass/AffineProofs.vo
1m56.19s | 2056320 ko | fiat-go/32/p384/p384.go
1m30.01s | 2053492 ko | SlowPrimeSynthesisExamples.vo
If we're careful enough, 2GB--3GB per coq process is enough for most fiat-crypto files. I think much of fiat-crypto can usefully use 6--8 cores in parallel (and I've gone through some pains to achieve good file-level parallelism), though we are sometimes blocked on single slow files (like the 10m GarageDoor file that depends on almost everything else in the repo). I think with 14 GB RAM fiat-crypto could safely be built with -j4 -async-proofs-j 1
, which would be pretty good. I think 8 GB RAM might just barely be enough for -j2 -async-proofs-j 1
.
Karl Palmskog said:
right, yes, after Maxime added custom runners, we had more OOMs in the Coq opam archive. I think the old ones had at least 7 GB of memory, now it might be something like 4 GB.
Which tags were you using?
AFAICT, the default gitlab shared runner has 3.75 Gb RAM (https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html)
Oh sorry, I see that @Paolo Giarrusso pointed to the same link
So I'm a bit lost: did I fail to reproduce the same setup as before?
Indeed, for some reason the setup is not the same as before AFAIU, but I don't know why.
@Maxime Dénès @Gaëtan Gilbert Would it be fine to re-enable shared runners for the Coq repository to boost the PR processing cycle during the release process? I can monitor closely our usage and we can decide upfront a limit that we do not want to go over (e.g., 30%).
do as you like, just tell us when you change something
we were fine wasting many percent last month so 30% is pretty conservative imo
Done.
So we are indeed burning CI minutes at a fast rate with shared runners on. We jumped from about 6% use this morning to 26%.
We've already reached 34% use. I've disabled the shared runners. Apparently, this doesn't stop the current running jobs (which is rather good in this case).
I'm adding more runners
what's our current strategy for jobs that need priviledged docker? https://gitlab.com/coq/coq/-/jobs/3433179724 is stuck
The strategy I suggested was to save our shared runner minutes for them
ok I'm enabling shared runners until they pick up the job then disabling them
it got picked up fast wow
Maxime Dénès said:
I'm adding more runners
where are they coming from btw?
Inria's CloudStack
We got an extra allowance?
?
Ah no, these were the 8 stopped runners, right?
Does it mean that coq-docker-machine-driver
should be removed?
Also, it would be really good to get documentation of this infrastructure since it looks like it's going to remain like this indefinitely. This would be helpful so that others that you Maxime can act when there are maintenance issues.
Théo Zimmermann said:
Does it mean that
coq-docker-machine-driver
should be removed?
Yes.
@Maxime Dénès A job failed with "no space left on device". I think that's on one of the new runners: https://gitlab.com/coq/coq/-/jobs/3451330182
Last updated: Feb 02 2023 at 13:03 UTC