Stream: Coq devs & plugin devs

Topic: CI pipeline minutes running out


view this post on Zulip Théo Zimmermann (Oct 03 2022 at 17:03):

They have updated the doc, so it looks like a change on purpose.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 17:03):

But sure, if there were no announcements, we should probably contact them to say that we need time to adjust.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 17:04):

I personally haven't researched if there were any announcement or twitter discussion about this.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 17:07):

We would only max out a 64-core server on special weeks, for the rest that'd be fine

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 17:09):

Pierre-Marie Pédrot said:

Honestly I don't see what could go wrong with deactivating everything. Thanks to the magic of VCS, worst case we can revert or tweak the result.

Not sure I understand what you propose here. Deactivating everything means we don't have CI, so who's gonna do the testing / revert when a PR lands and a breaks everything

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 17:09):

If I understand currently, as the PR has been merged, we should block PR merge too?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 17:12):

There is a "talk to an expert" button on https://about.gitlab.com/sales/. Can somebody ask them what is going on?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 17:13):

something like this: (I won't send it)

Hi, you seem to have changed how the minutes for the runners are being calculated. The doc also reflects this change.

I find this somewhat absurd as there was no announcement to the said change and it is heavily affecting our development CI pipeline.

Since the Coq organization is a paying for the runner services, it seems somewhat predatory that the calculations for minutes can be changed as such.

Please clarify the situation.

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 17:46):

we're not paying afaik

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:24):

Like @Guillaume Melquiond, I'm not confident at all that Inria's Gitlab CI will handle the load, especially if several large projects migrate simultaneously

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:26):

What I would do, is in parallel:

I can initiate both.

view this post on Zulip Guillaume Melquiond (Oct 03 2022 at 18:31):

By the way, I am noticing that we are currently using 4TB (!) of storage on gitlab.com. So, runners might not be our only issue.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:34):

I think we can sequence the operations:

  1. Restore our current CI, by adding workers
  2. Improving on our usage of resources and target (better coqbot workflow, gitlab.inria.fr target, etc)

I agree it's better to not rush 2 too much. Doing a bit of temporary work for 1 does not seem problematic to me.

view this post on Zulip Guillaume Melquiond (Oct 03 2022 at 18:35):

To put things into perspective, in three weeks (if I understand correctly), the cost will be $6 per GB, so let us be ready for a massive bill.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:37):

Yes, we need to address it. Still, TV news are not mentioning that we have CI issues, so I guess we can take the time to do things in order :)

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:38):

@Guillaume Melquiond Do you know what this space is used for? Do we store artifacts for all builds?

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:39):

And don't we have a retention policy?

view this post on Zulip Guillaume Melquiond (Oct 03 2022 at 18:39):

It seems we are using 3894.7 GiB for job artifacts.

view this post on Zulip Guillaume Melquiond (Oct 03 2022 at 18:40):

https://gitlab.com/groups/coq/-/usage_quotas

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 18:41):

build:base is 300MB, then we have all the other stuff so multiple GB / PR I guess

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:41):

Let's create a channel for communication actions that we take on this CI issue. We will use this one to discuss about the situation and what should be done.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:44):

Was there some communication done to coq devs about the process they should currently follow to merge PRs?

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 18:49):

also artifacts for the latest commit on a branch are never deleted
with our coqbot system that means artifacts for open PRs
so for instance this 11month old job still has 270MB of artifacts and then there's the other jobs in the pipeline https://gitlab.com/coq/coq/-/jobs/1739082131

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:50):

yeah that's what I meant by retention policy

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 18:51):

that's actually a setting https://docs.gitlab.com/ee/ci/pipelines/job_artifacts.html#keep-artifacts-from-most-recent-successful-jobs so maybe we should unset it

view this post on Zulip Ali Caglayan (Oct 03 2022 at 18:55):

Should I refrain from pushing to PRs for the time being?

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:58):

@Gaëtan Gilbert Should @Ali Caglayan rebase his PRs to get only the reduced CI or not?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 18:59):

The PR I am thinking about already has a lot of the jobs removed, but I am unsure how severe the situation is today. I don't want to use up resources that could be better used for others.

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 19:14):

Maxime Dénès said:

Gaëtan Gilbert Should Ali Caglayan rebase his PRs to get only the reduced CI or not?

no need to rebase, coqbot auto merges

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 19:15):

Ali Caglayan said:

Should I refrain from pushing to PRs for the time being?

push on, it will help tell us if we reduced enough

view this post on Zulip Ali Caglayan (Oct 03 2022 at 19:30):

Looks reduced https://gitlab.com/coq/coq/-/pipelines/657296145

view this post on Zulip Michael Soegtrop (Oct 03 2022 at 19:52):

I came a bit late to this thread. I guess Coq Platform is not affected since I am using only GitHub actions?

view this post on Zulip Maxime Dénès (Oct 03 2022 at 19:53):

I guess so, this seems to be a gitlab.com policy change.

view this post on Zulip Erik Martin-Dorel (Oct 03 2022 at 19:57):

That's very bad news :-(

BTW regarding the docker-coq infrastructure in coq-community: the docker-keeper tooling specifically relies on GitLab CI.
Migrating to GitHub Actions is just not feasible (e.g. YAML anchors are not supported by GHA, nor the generation of children pipelines from a dynamic YAML is not supported…)

view this post on Zulip Erik Martin-Dorel (Oct 03 2022 at 19:57):

Fortunately, the auto-build/deployment of coqorg/* images should not be as costly in CI/CD minutes as PRs pipelines.
But I cannot check the quotas myself in https://gitlab.com/groups/coq-community/-/usage_quotas#pipelines-quota-tab as only @Théo Zimmermann can → you may want to take a look at the row for coq-community/docker-coq, in order to extrapolate

view this post on Zulip Ali Caglayan (Oct 03 2022 at 20:00):

Here is the GitLab issue with the changes: https://gitlab.com/gitlab-org/gitlab/-/issues/337245

view this post on Zulip Erik Martin-Dorel (Oct 03 2022 at 20:27):

I googled a bit more and, according to this blog article https://about.gitlab.com/blog/2021/11/11/public-project-minute-limits/ that I had not spotted before, this drastic change is intended to fight cryptomining via CI pipelines.
But that's a pity that FOSS projects like Coq which consume much more pipeline time that the average of other projects, are badly impacted by these changes… Maybe, reaching them by the email opensource@gitlab.com mentioned in https://about.gitlab.com/handbook/marketing/community-relations/opensource-program/ to just let us know (?)
my 2 cents…

view this post on Zulip Erik Martin-Dorel (Oct 03 2022 at 20:33):

Finally FYI, it seems that there's another thing to double-check now:
the Namespace Storage Quota: https://gitlab.com/groups/math-comp/-/usage_quotas#storage-quota-tab

as according to https://docs.gitlab.com/ee/user/usage_quotas.html#namespace-storage-limit-enforcement-schedule

Storage limits for GitLab SaaS Free tier namespaces will not be enforced prior to 2022-10-19. Storage limits for GitLab SaaS Paid tier namespaces will not be enforced for prior to 2023-02-15.

IINM, coq and coq-community are not impacted for now, unlike math-comp, which is still in the Free Tier…
(see e.g. this topic)

view this post on Zulip Michael Soegtrop (Oct 03 2022 at 20:35):

I don't quite see how this change keeps people from crypto mining in CI if forks remain at 0.008 - who prevents people of forking their own projects? I guess one could find more intelligent methods for preventing such abuse.

view this post on Zulip Michael Soegtrop (Oct 03 2022 at 20:37):

Maybe it is possible to talk to them and get an exception.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 23:01):

Ok, I got something automated enough (snapshots + Ansible playbook). I should be able to add ~8 runners tomorrow.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 23:03):

Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?

view this post on Zulip Guillaume Melquiond (Oct 04 2022 at 00:46):

Michael Soegtrop said:

Maybe it is possible to talk to them and get an exception.

I very much doubt so. Our quota of 50'000 minutes corresponds to their "ultimate" offer, which normally costs $99 per user and which we are currently getting for free. There is not even a higher tier for paying customers. So, I don't see how they could provide an even higher tier for non-paying customers. Also keep in mind that their "free" offer is just 400 minutes, so we are currently getting 125x that. In other words, we are very lucky and we should not push our luck.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:02):

Maxime Dénès said:

Was there some communication done to coq devs about the process they should currently follow to merge PRs?

Communication happened by this message: https://github.com/coq/coq/pull/16600#issuecomment-1265773392. I know it's not ideal, but any dev with push right should have got it.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:02):

Gaëtan Gilbert said:

that's actually a setting https://docs.gitlab.com/ee/ci/pipelines/job_artifacts.html#keep-artifacts-from-most-recent-successful-jobs so maybe we should unset it

Yes, we should definitely use it. Something changed recently? It didn't use to be the case that old PRs have their artifacts kept.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:02):

Erik Martin-Dorel said:

Fortunately, the auto-build/deployment of coqorg/* images should not be as costly in CI/CD minutes as PRs pipelines.
But I cannot check the quotas myself in https://gitlab.com/groups/coq-community/-/usage_quotas#pipelines-quota-tab as only Théo Zimmermann can → you may want to take a look at the row for coq-community / docker-coq, in order to extrapolate

The entire coq-community organization record CI consumption is 18,000 minutes in Feb 2022. We are allowed 100,000. So no, Docker-Coq will not be affected. The case of the Coq project is really special since we are spending so much.
Regarding the storage use, it is 6.4 GiB for coq-community including 3.41 GiB for Docker-Coq artifacts and 2.8 GiB for the coq-performance-tests repository (we could stop mirroring it there if needed).

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:03):

Guillaume Melquiond said:

Michael Soegtrop said:

Maybe it is possible to talk to them and get an exception.

I very much doubt so. Our quota of 50'000 minutes corresponds to their "ultimate" offer, which normally costs $99 per user and which we are currently getting for free. There is not even a higher tier for paying customers. So, I don't see how they could provide an even higher tier for non-paying customers. Also keep in mind that their "free" offer is just 400 minutes, so we are currently getting 125x that. In other words, we are very lucky and we should not push our luck.

The only thing that might be reasonable to ask is for a delay in order to resize our CI infrastructure.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:03):

Emilio Jesús Gallego Arias said:

We would only max out a 64-core server on special weeks, for the rest that'd be fine

Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average. Given our memory consumption, it's not clear that 18.5 runners would fit on a single 64-core server (it would need lots of RAM) and this is not even the usage peak, this is the average over a full month, days and nights.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:03):

Maxime Dénès said:

Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?

Why should we disable shared runners in case you add VMs without tags? BTW, 8 runners won't be sufficient to entirely replace our shared runner usage if we go back to our full CI according to my calculation above.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 07:06):

Ok, so we'd need a combination of reduced CI + self hosted runners, right?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 07:07):

(and studying a migration to gitlab.inria.fr in parallel, that is talking with the guys about the capacity of their service)

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:12):

Yes, I think this is a good summary.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:48):

Théo Zimmermann said:

Emilio Jesús Gallego Arias said:

We would only max out a 64-core server on special weeks, for the rest that'd be fine

Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average. Given our memory consumption, it's not clear that 18.5 runners would fit on a single 64-core server (it would need lots of RAM) and this is not even the usage peak, this is the average over a full month, days and nights.

Inria runners are considerably faster, so the amount of minutes would be less.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:49):

Also a few jobs are responsible of 60% build time (see the analysis when we discussed removing the async job)

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:50):

Coq Universe provides a good reference IMHO of how much is a full CI, around 1h in a 8 cores machine with 32GiB of RAM

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:50):

That's without caching

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:51):

in a Github actions runner, the vo build will take like 2.5 hours

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:51):

with just 2 cores

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:51):

with caching it varies

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:51):

Emilio Jesús Gallego Arias said:

Also a few jobs are responsible of 60% build time (see the analysis when we discussed removing the async job)

If you have such data available, I'm very interested to look at them.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:51):

The math is easy, look at the jobs and sort by time

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:53):

How do you do that?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:53):

The original thread is https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs-.26-plugin-devs/topic/CI.20for.20base.2Basync.20timing.20out

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:56):

Thanks! How did you get these figures? I can't seem to figure out how to extract them for a GitLab pipeline.

Total CI time without the heavy ones: 665
Total CI time with (unimath, vst, and fiat-crypto): 1155

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:59):

I just cut and paste + emacs macro

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:59):

link

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:01):

https://hackmd.io/PGrScXsoQVSLckPu3CXvIw

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:02):

Give me any pipeline and I will produce the table

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 08:02):

Maxime Dénès said:

Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?

no tags imo

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:04):

Emilio Jesús Gallego Arias said:

Give me any pipeline and I will produce the table

So indeed we can optimize a lot, for example if we use dune, checking compilation of Coq ML part + all plugins is just a couple of minutes

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:04):

maybe 3, 4 max (depends on the runner)

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:05):

In my machine 22 secs

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 08:14):

the lint job does that and in your table it took 03:37

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 08:15):

it also does a bit of other stuff (whitespace checks etc)

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:41):

It took 3:37 total

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:42):

Boot time, checkout etc...

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:42):

I'm talking about the actual command only

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:44):

There is a lot variabilty in boot time too

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:44):

in runner boot time

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:46):

But total cpu time is almost 4 minutes

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:17):

We should also really do something about the test-suite jobs running forever because they start swapping or whatever.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:17):

In this reduced pipeline, the base test-sutie job was running for over 90 minutes before I killed it: https://gitlab.com/coq/coq/-/pipelines/657639626/builds

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:18):

Can you run your macros on this pipeline anyway?

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:19):

I'd like a comparison with the FULL_CI ones.

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 09:28):

Should we write a PR to get some logging information about the test-suite taking ages?

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 09:28):

Cancelling them after a timeout is a work around, but we should get to the root of the issue.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:29):

Some have hypothesized swapping. Could we auto-detect if swapping happens and kill the test-suite immediately in that case?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:48):

@Théo Zimmermann

| Status | Job | Stage | Name | Duration |
|--------|-----|-------|------|----------|
| passed | #3122706083 pr-16601 0183e2d4 | build | plugin:plugin-tutorial |  00:05:30 |
| passed | #3122706082 pr-16601 0183e2d4 | build | validate:vio |  00:02:32 |
| passed | #3122706080 pr-16601 0183e2d4 | build | validate:base |  00:04:45 |
| passed | #3122706079 pr-16601 0183e2d4 | build | test-suite:edge:dev |  00:25:58 |
| canceled | #3122706078 pr-16601 0183e2d4 | build | test-suite:base |  01:31:22 |
| passed | #3122706075 pr-16601 0183e2d4 | build | doc:ml-api:odoc |  00:04:11 |
| passed | #3122706070 pr-16601 0183e2d4 | build | doc:stdlib |  00:04:29 |
| passed | #3122706064 pr-16601 0183e2d4 | build | doc:refman-pdf |  00:11:08 |
| passed | #3122706061 pr-16601 0183e2d4 | build | doc:refman |  00:07:42 |
| passed | #3122706058 pr-16601 0183e2d4 | build | pkg:nix |  00:21:46 |
| passed | #3122706055 pr-16601 0183e2d4 | build | lint |  00:05:02 |
| passed | #3122706051 pr-16601 0183e2d4 | build | build:vio |  00:25:44 |
| passed | #3122706049 pr-16601 0183e2d4 | build | build:edge+flambda:dev |  00:18:48 |
| passed | #3122706046 pr-16601 0183e2d4 | build | build:base | 00:17:46 |

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:48):

oh sorry let me sort it

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:49):

| Status   | Job                           | Stage | Name                   | Duration |
|----------+-------------------------------+-------+------------------------+----------|
| canceled | #3122706078 pr-16601 0183e2d4 | build | test-suite:base        | 01:31:22 |
| passed   | #3122706079 pr-16601 0183e2d4 | build | test-suite:edge:dev    | 00:25:58 |
| passed   | #3122706051 pr-16601 0183e2d4 | build | build:vio              | 00:25:44 |
| passed   | #3122706058 pr-16601 0183e2d4 | build | pkg:nix                | 00:21:46 |
| passed   | #3122706049 pr-16601 0183e2d4 | build | build:edge+flambda:dev | 00:18:48 |
| passed   | #3122706046 pr-16601 0183e2d4 | build | build:base             | 00:17:46 |
| passed   | #3122706064 pr-16601 0183e2d4 | build | doc:refman-pdf         | 00:11:08 |
| passed   | #3122706061 pr-16601 0183e2d4 | build | doc:refman             | 00:07:42 |
| passed   | #3122706083 pr-16601 0183e2d4 | build | plugin:plugin-tutorial | 00:05:30 |
| passed   | #3122706055 pr-16601 0183e2d4 | build | lint                   | 00:05:02 |
| passed   | #3122706080 pr-16601 0183e2d4 | build | validate:base          | 00:04:45 |
| passed   | #3122706070 pr-16601 0183e2d4 | build | doc:stdlib             | 00:04:29 |
| passed   | #3122706075 pr-16601 0183e2d4 | build | doc:ml-api:odoc        | 00:04:11 |
| passed   | #3122706082 pr-16601 0183e2d4 | build | validate:vio           | 00:02:32 |

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:49):

I'd cut the builds except for the flambda + test suite flabmda

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:49):

build:vo and build:base

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:50):

and the validate jobs

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:50):

are a loss of time in a minimal CI

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:50):

and instead I'd add the composed plugin build

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 11:52):

This is somewhat making the CI useless to me at least. I can run the test-suite locally faster than what the CI provides.

view this post on Zulip Matthieu Sozeau (Oct 04 2022 at 11:53):

I’m just finding this out, shall I try to contact the guys at gitlab open source to tell them about our situation then?

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 11:56):

Sure, this is worth trying to at least alert them that we would appreciate some additional delay to sort things out.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 12:09):

Pierre-Marie Pédrot said:

This is somewhat making the CI useless to me at least. I can run the test-suite locally faster than what the CI provides.

I'm not sure I follow PM, of course local computers are faster than the VMs for free

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 12:09):

Tho gitlab.inria large instances are much faster

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 12:13):

@Emilio Jesús Gallego Arias my typical use of the CI is to run stuff that is annoying to run. If we only keep a tiny subset of the CI, then it becomes trivial because I always run the test-suite locally. By contrast, running all the other tests are typically bothersome because you need some setup, network connection and parallel execution (if you go for the full CI with many devs).

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 12:13):

The question is not whether the free VM are slow, the question is the trade-off between the time where my PC is doing something vs. some stuff running overnight on the internet

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 12:14):

For a CI I can run in less than 10 minutes on my PC, this is not very clear.

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:18):

I think we want to get some private runners (or use the gitlab.inria ones if there are enough) then put back most or all the jobs, possibly merging jobs if that exploits runner cpus better (ie coq-universe job instead of a bunch of jobs)

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:19):

the current job disabling I see as a temporary measure to avoid completely breaking and to keep opam-coq-archive who did nothing wrong working

view this post on Zulip Matthieu Sozeau (Oct 04 2022 at 12:20):

I sent a mail with a plea for a grace period, let's see what they say.

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:24):

also I'm unchecking the "Keep artifacts from most recent successful jobs" setting, I don't think it's especially useful to have for us

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 12:41):

We should keep them for the bench though.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:42):

I added 8 runners, you may want to put back some jobs.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:44):

I will now initiate a discussion with our services about the capacity and security of gitlab.inria.fr.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 12:45):

Let's not hurry in putting back some jobs / check how our usage evolves. Since the discussion started less than 24h ago, we have still spend about 8% of our quotas.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:45):

It is not obvious that we can use it, because the current model is allowing only sponsored Inria accounts, whereas what we want to do is essentially to mirror arbitrary workloads.

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:45):

Pierre-Marie Pédrot said:

We should keep them for the bench though.

why?

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 12:46):

Maxime Dénès said:

It is not obvious that we can use it, because the current model is allowing only sponsored Inria accounts, whereas what we want to do is essentially to mirror arbitrary workloads.

What do you mean?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:46):

Aren't we exposing code execution on our runners to anybody?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:47):

Anyway let's discuss these aspects privately.

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:50):

what happened to coq-gitlab-runner2.ci ?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:51):

ah ah I was waiting for someone to ask

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:51):

are you sure you want to know?

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:52):

yes

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:53):

should we try disabling shared runners, enable full ci and see how big the queue gets?

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 12:53):

Sure, that's a good experiment to make!

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 12:54):

And it's also easy to do: no PR to merge, just a variable to set.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:54):

the CloudStack Inria API doesn't let you destroy resources completely, you have to wait for a 24h cron to do it

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:54):

I created a runner numbered 2 then deleted it, so I can't reuse the name yet

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:54):

similar pb for storage

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 12:54):

@Gaëtan Gilbert I want the bench artifacts to have a more precise look at the timings. There are HTML files to display visually the results, and furthermore I've got a bunch of scripts to parse this and extract relevant data.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:55):

I don't know why they put this Jenkins-related logic in their Cloud backend, it is a very bad separation of concerns

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:55):

Théo Zimmermann said:

Sure, that's a good experiment to make!

let's go then

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 12:55):

Pierre-Marie Pédrot said:

Gaëtan Gilbert I want the bench artifacts to have a more precise look at the timings. There are HTML files to display visually the results, and furthermore I've got a bunch of scripts to parse this and extract relevant data.

Gaëtan is not removing artifact for all branches, he's just disabling artifacts for stale branches AFAIU.

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:56):

I would say enabling normal expiration times rather than disabling but I guess not a lot of difference

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:58):

@Maxime Dénès how beefy are the vms you added? should we increase the -j we use?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 13:02):

they have 2 vCPUs, 4 Gb RAM

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 13:02):

already queuing in 1 pr https://gitlab.com/coq/coq/-/pipelines/657973015

view this post on Zulip Maxime Dénès (Oct 04 2022 at 13:02):

it was the closest I could find to the instances use by gitlab's shared runners

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 13:02):

current -j2 stays then

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 16:16):

the runners don't allow ulimit so all the scripts using it are failing, eg https://gitlab.com/coq/coq/-/jobs/3124630681

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 16:23):

also https://gitlab.com/coq/coq/-/jobs/3124630609 (ocamlopt fails in schemeequality.v) is interesting considering how that test kept hanging

view this post on Zulip Pierre Roux (Oct 04 2022 at 16:46):

If we need them to go through, we could deactivate native compilation, that was the only reason for ulimit IIRC.

view this post on Zulip Guillaume Melquiond (Oct 04 2022 at 16:59):

Gaëtan Gilbert said:

the runners don't allow ulimit

No, they do allow ulimit, but we have forbidden it. We need to modify .gitlab-ci.yml line 200 and add option -S.

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 17:00):

https://github.com/coq/coq/pull/16612

view this post on Zulip Olivier Laurent (Oct 04 2022 at 19:59):

What should be done with ongoing PRs (in particular those ready to merge)? wait a bit that solutions are settled for CI troubles?
I am not sure I understand what is going on. Is https://gitlab.com/coq/coq/pipelines/new?var[FULL_CI]=true&ref=pr-XXXXX to be run or not?

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 07:15):

@Olivier Laurent at the moment the full CI pipelines are on again by default but that may not last. If you want to merge a PR, just check that the GitLab pipeline status check has the usual long list of job. If yes, it can be merged. If not, it should get a full CI run before.

view this post on Zulip Thierry Martinez (Oct 05 2022 at 08:41):

Maxime Dénès said:

the CloudStack Inria API doesn't let you destroy resources completely, you have to wait for a 24h cron to do it

If there is a need to go faster, you can ask a CI admin to destroy the resource before the cron job (but, of course, it only works if a CI admin is available... I was attending a training course the two last days).

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:04):

Théo Zimmermann said:

Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average.

After reading the documentation and double-checking on the usage webpage, our CI actually uses 1,600,000 minutes per month, but because we are an open source project, they only count for half of it: "The cost factors for jobs running on shared runners on GitLab.com are 0.5 for public projects in the GitLab for Open Source program." So, this amounts to 37 full-time runners on average.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:11):

@Guillaume Melquiond I think you've misunderstood how the usage is reported. There are two tabs: CI/CD minute usage and Shared runner duration. The latter is the real one and the former is how it is counted. Thus, in October we have used 80,000 actual shared runner minutes and they have counted as 40,000. In September, the cost factor was different so we have used 800,000 actual shared runner minutes and they have counted as 32,000.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:13):

Our peak was in March: 820,000 actual shared runner minutes.

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:14):

Right, my mistake. So, we are only using 800,000 minutes, but we still need to divide this usage by 8.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:15):

And for the month of October, this is even worse since we have already spent 80% of it.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:15):

So we should bring it down to basically zero on the rest of the month (or buy more minutes) if we want the opam-coq-archive CI to remain unaffected.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:16):

Disabling the shared runners on the Coq repository (like we did) is a good way of ensuring this.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:18):

Currently, there is no queue in Coq CI despite the full CI being on and the shared runners disabled, but that may be related to the ulimit issue (maybe that make jobs fail and thus stop quicker).

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:18):

Regarding opam-coq-archive, I am in the process of cutting part of its CI down, because even if it is a few orders of magnitude less than Coq itself, it is still blowing up our quota.

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:19):

(Almost 20% of the quota after just 5 days.)

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:19):

We could also consider migrating the opam-coq-archive CI to GitHub Actions. Do you think that it would be a realistic option?

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:21):

No idea, I don't understand how Github Actions work. (And there is still a quota for those, isn't it?)

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:25):

No AFAIU, there is no limit to how many CI minutes you can use. There is a limit to how many parallel runners you get (4 IIRC). So this is the latter which would make it impossible to use for Coq's CI without additional GitHub sponsorship.

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:25):

Ah no, it seems those are unlimited, as long as we don't ask for more than 2 cores.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:26):

"those" being?

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:27):

Github Actions. I was correcting myself.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:27):

https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits seems to indicate that the actual limit is 20 parallel runners on Linux / Windows and 5 on macOS.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:28):

They do not say the scope but I think it's probably the organization.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:29):

A good thing about GitHub Actions is that you get a 6-hour timeout per job, so that's better than what we have on GitLab with the shared runners.

view this post on Zulip Maxime Dénès (Oct 05 2022 at 12:35):

Thierry Martinez said:
If there is a need to go faster, you can ask a CI admin to destroy the resource before the cron job (but, of course, it only works if a CI admin is available... I was attending a training course the two last days).

So you confirm that this 24h delay exists? I received inconsistent answers on the matter. Why bake this CI-related logic in the Cloud backend, and not let us destroy resources that we created ourselves immediately?

view this post on Zulip Maxime Dénès (Oct 05 2022 at 12:37):

The whole point of CloudStack-like tools is to not have to open tickets for routine operations :)

view this post on Zulip Maxime Dénès (Oct 05 2022 at 12:38):

That ci.inria.fr has this kind of business logic I understand, but I'm very surprised that it ends up being restrictions on the CloudStack side.

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:51):

I can confirm that, a few months ago, this delay did exist. (I had to remove a VM because I was over-quota. The VM removal happened instantly, but the associated storage took 24 hours to disappear.)

view this post on Zulip Pierre-Marie Pédrot (Oct 05 2022 at 13:23):

has coqbot fallen to the same worker depletion or what? it doesn't react on https://github.com/coq/coq/pull/16611

view this post on Zulip Pierre-Marie Pédrot (Oct 05 2022 at 13:29):

maybe it's on strike

view this post on Zulip Pierre-Marie Pédrot (Oct 05 2022 at 13:32):

@Théo Zimmermann have you deactivated coqbot to prevent people merging carelessly w.r.t. the non-full CI?

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:43):

No, I didn't do anything.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:43):

I'll have a look :ambulance:

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:45):

That's a GitHub issue: https://www.githubstatus.com/

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:45):

Webhooks are not being delivered correctly.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:45):

So coqbot never received these messages.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:46):

You can insist (some issue comments did trigger the webhook, but I don't know what is the current success rate) or wait for the webhooks to be delivered (they might be with some delay).

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 16:09):

Folks, one problem with disabling the shared runners vs tagging the jobs for our workers is that we can't generate docker images anymore

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 16:20):

Yes, this was an experiment, but since it seems like we are going to keep this setup for a while, we should do a tag-based system to keep the shared runners available, but for Docker images only.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 16:54):

I'm gonna enable them briefly so my docker image is rebuilt

view this post on Zulip Maxime Dénès (Oct 05 2022 at 21:25):

It looks like the queue of jobs is increasing..

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 22:42):

Where do you see the queue?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 22:45):

Here of course https://gitlab.com/coq/coq/-/jobs ...

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 22:45):

I guess we will be back a bit to the times we needed the night for CI to catch up

view this post on Zulip Maxime Dénès (Oct 06 2022 at 07:00):

Once we can estimate our needs, we should be able to order runners on an external Cloud provider if needed.

view this post on Zulip Maxime Dénès (Oct 06 2022 at 07:00):

But I find it also good to try to reduce a bit our usage, because of environmental concerns.

view this post on Zulip Maxime Dénès (Oct 06 2022 at 07:00):

(at least the lowest hanging fruit, like what Théo is implementing in coqbot)

view this post on Zulip Théo Zimmermann (Oct 06 2022 at 07:52):

It looks like the queue is still big after the night.

view this post on Zulip Théo Zimmermann (Oct 06 2022 at 07:53):

Hopefully, it will catch up now that we are running light CI pipelines by default on PRs.

view this post on Zulip Maxime Dénès (Oct 07 2022 at 14:11):

Btw @Thierry Martinez kindly improved the configuration of Inria's CloudStack, so there's hope we can make quick progress on the way these new runners are orchestrated.

view this post on Zulip Gaëtan Gilbert (Oct 10 2022 at 15:04):

Maxime Dénès said:

Ok, I got something automated enough (snapshots + Ansible playbook). I should be able to add ~8 runners tomorrow.

is your automation public?

view this post on Zulip Maxime Dénès (Oct 10 2022 at 15:17):

That one no, but I'm almost done with a better one, properly orchestrating ephemeral VMs being developed here: https://gitlab.inria.fr/mdenes/docker-machine-cloudstack

I will stabilize it, then replace the current runners with this orchestration. Warning: it is currently fully undocumented and moving and breaking often.

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:00):

The new runners seem to work fine so far. @Gaëtan Gilbert FYI, they should be able to handle the docker boot job as well.

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:00):

(not tested yet)

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:01):

It should be fairly easy to add compute from an external cloud provider using the same orchestration, btw.

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:03):

Oh, and also: these runners are now available to the coq group, not just the coq project. @Théo Zimmermann I think you needed it for another project, but I don't recall exactly.

view this post on Zulip Théo Zimmermann (Oct 11 2022 at 15:03):

Not me, the opam-coq-archive maintainers.

view this post on Zulip Théo Zimmermann (Oct 11 2022 at 15:04):

Well, technically, they were fine with the shared runners apparently (except for long-running jobs, but that's not new).

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:13):

@Thierry Martinez Thanks again for your help on this. I will try to document this orchestration and its deployment, in case it is useful to other projects.

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:25):

About the macos worker, if it is only used for storage occasionally, can I snapshot and delete it? @Théo Zimmermann @Michael Soegtrop

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:25):

It would probably free resources for two more runners.

view this post on Zulip Théo Zimmermann (Oct 11 2022 at 15:25):

:+1: from me if you have the :+1: from Michael

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:32):

We should probably also think about our usage of the bench machines. Am I right that they are idle most of the time?

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:32):

We could use them for CI when they are not used for benchmarking.

view this post on Zulip Théo Zimmermann (Oct 11 2022 at 15:34):

Absolutely, if you find a way of making sure that bench jobs can "lock" them (and of course they wait until they are fully "locked" before starting running).

view this post on Zulip Maxime Dénès (Oct 12 2022 at 11:24):

@Gaëtan Gilbert Am I right that pendulum is unused these days?

view this post on Zulip Maxime Dénès (Oct 12 2022 at 11:24):

(if so, I can add some runners on it to improve a bit our CI situation)

view this post on Zulip Gaëtan Gilbert (Oct 12 2022 at 11:29):

yes (it's set as paused in the runner list currently)

view this post on Zulip Jim Fehrle (Oct 18 2022 at 23:09):

A couple thoughts:

view this post on Zulip Théo Zimmermann (Nov 05 2022 at 13:22):

In the end, we finished October having only used 86% of our CI allowance and 5 days into November we have only used 1% (since shared runners are only enabled on opam-coq-archive). Any opposition to enabling shared runners back on the Coq repo as well and monitoring how it goes? (With the light pipelines by default and our own runners enabled, we do not risk CI minutes running out as quickly as in the beginning of October.)

view this post on Zulip Gaëtan Gilbert (Nov 09 2022 at 17:31):

would it be useful?

view this post on Zulip Théo Zimmermann (Nov 10 2022 at 07:47):

Well, sometimes pipelines are waiting quite a lot in queue because of limited runner availability. We also still haven't seen a fully green CI for several weeks because of fiat-crypto and VST failing on the new runners (or has this been solved now?). Furthermore, since these shared runners are free, there would not be much point in not using them if they can help.

view this post on Zulip Pierre-Marie Pédrot (Nov 14 2022 at 13:59):

VST CI should now be passing on master after https://github.com/coq/coq/pull/16795

view this post on Zulip Pierre-Marie Pédrot (Nov 14 2022 at 14:00):

but fiat-crypto is still using too much memory, as it seems that its custom makefile ignores the NJOBS option

view this post on Zulip Pierre-Marie Pédrot (Nov 14 2022 at 14:00):

CI has been broken for more than a month, we should really do something about it

view this post on Zulip Pierre-Marie Pédrot (Nov 14 2022 at 14:01):

there was the idea to use tags for memory-intensive runs, is it hard to implement?

view this post on Zulip Théo Zimmermann (Nov 15 2022 at 08:18):

I don't recall exactly how tags work, but probably we could do something like : tag our runners, tell them to only run tagged jobs, tag all our jobs with the same tag and make fiat-crypto and a few other jobs the only ones without tags, re-enable the shared runners so that we use them for these.

view this post on Zulip Gaëtan Gilbert (Nov 15 2022 at 09:14):

do the shared runners actually have more power? https://github.com/coq/coq/pull/16801#issuecomment-1314584174 sounds like fiat-crypto started using more memory

view this post on Zulip Karl Palmskog (Nov 15 2022 at 09:16):

I think the shared runners actually have less memory than GitHub runners. For example, in the Coq opam archive, we started seeing OOMs more frequently with shared runners

view this post on Zulip Théo Zimmermann (Nov 15 2022 at 13:03):

Karl Palmskog said:

I think the shared runners actually have less memory than GitHub runners. For example, in the Coq opam archive, we started seeing OOMs more frequently with shared runners

What do you call "GitHub runners" and what do you call "shared runners" exactly? It looks to me like what you want to say is that you started to see more OOMs with "custom runners" (the ones that were added by Maxime) compared to when we only used "shared runners" (the ones made available by GitLab). If that's correct, then this would match my impression of what has happened on the side of the Coq repo (where shared runners had no trouble with VST and fiat-crypto but the custom runners using the former bench machines did).

view this post on Zulip Karl Palmskog (Nov 15 2022 at 13:07):

right, yes, after Maxime added custom runners, we had more OOMs in the Coq opam archive. I think the old ones had at least 7 GB of memory, now it might be something like 4 GB.

view this post on Zulip Michael Soegtrop (Nov 15 2022 at 14:06):

4GB is indeed very tight for building Coq stuff and likely not very efficient because the machine will likely swap (with -j 2).

view this post on Zulip Paolo Giarrusso (Nov 16 2022 at 00:44):

https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html says "1 vCPU, 3.75GB RAM" (which is what I've seen). medium machines are available for free projects (enable with tags: saas-linux-medium-amd64), are twice as big, and consume twice as many minutes (https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html#additional-costs-on-gitlab-saas)

view this post on Zulip Michael Soegtrop (Nov 16 2022 at 08:48):

Do we know how much total RAM and CPU would be sufficient to run Coq CI reasonably efficiently?

view this post on Zulip Karl Palmskog (Nov 16 2022 at 08:56):

my experience is that 4 cores and something like 10-12 GB RAM works well

view this post on Zulip Karl Palmskog (Nov 16 2022 at 08:56):

but not sure this would be even enough that outliers like Fiat Crypto work as expected

view this post on Zulip Michael Soegtrop (Nov 16 2022 at 09:32):

I more meant how much total GB and total CPU we need to run Coq CI efficiently.

view this post on Zulip Jason Gross (Nov 16 2022 at 16:28):

Fiat Crypto currently peaks at a bit over 4GB per thread/process. Last I checked, our heaviest (> 2GB) files memory-wise are

  1m02.21s | 4379108 ko | Bedrock/End2End/RupicolaCrypto/Derive.vo
  3m06.05s | 2930324 ko | rupicola/bedrock2/compiler/src/compiler/FlatToRiscvFunctions.vo
  2m02.81s | 2751688 ko | rupicola/bedrock2/compiler/src/compiler/FlatToRiscvMetric.vo
  5m30.49s | 2649632 ko | Bedrock/Field/Synthesis/Examples/p224_64_new.vo
 10m08.34s | 2582048 ko | Bedrock/End2End/X25519/GarageDoor.vo
  4m17.69s | 2500784 ko | Assembly/WithBedrock/Proofs.vo
  1m49.62s | 2252340 ko | Fancy/Barrett256.vo
  0m34.38s | 2216436 ko | ExtractionOCaml/bedrock2_solinas_reduction.ml
  0m33.70s | 2199600 ko | ExtractionOCaml/with_bedrock2_word_by_word_montgomery.ml
  0m32.69s | 2199516 ko | ExtractionOCaml/bedrock2_word_by_word_montgomery.ml
  0m39.68s | 2147996 ko | ExtractionOCaml/word_by_word_montgomery
  0m44.54s | 2147880 ko | ExtractionOCaml/bedrock2_solinas_reduction
  0m43.08s | 2147548 ko | ExtractionOCaml/solinas_reduction
  1m56.73s | 2140160 ko | fiat-rust/src/p384_scalar_32.rs
  1m56.65s | 2139664 ko | fiat-java/src/FiatP384Scalar.java
  0m32.10s | 2130396 ko | ExtractionOCaml/solinas_reduction.ml
  0m16.34s | 2106988 ko | ExtractionHaskell/bedrock2_solinas_reduction.hs
  0m16.60s | 2106956 ko | ExtractionHaskell/with_bedrock2_solinas_reduction.hs
  0m31.79s | 2095372 ko | ExtractionOCaml/word_by_word_montgomery.ml
  0m29.60s | 2092128 ko | ExtractionOCaml/with_bedrock2_unsaturated_solinas.ml
  0m30.75s | 2091876 ko | ExtractionOCaml/bedrock2_unsaturated_solinas.ml
  1m56.18s | 2088112 ko | fiat-c/src/p384_scalar_32.c
  0m42.31s | 2088052 ko | ExtractionOCaml/bedrock2_word_by_word_montgomery
  0m41.92s | 2084496 ko | ExtractionOCaml/with_bedrock2_word_by_word_montgomery
  4m29.40s | 2077784 ko | Rewriter/Passes/ArithWithCasts.vo
  7m09.75s | 2070516 ko | Curves/Weierstrass/AffineProofs.vo
  1m56.19s | 2056320 ko | fiat-go/32/p384/p384.go
  1m30.01s | 2053492 ko | SlowPrimeSynthesisExamples.vo

If we're careful enough, 2GB--3GB per coq process is enough for most fiat-crypto files. I think much of fiat-crypto can usefully use 6--8 cores in parallel (and I've gone through some pains to achieve good file-level parallelism), though we are sometimes blocked on single slow files (like the 10m GarageDoor file that depends on almost everything else in the repo). I think with 14 GB RAM fiat-crypto could safely be built with -j4 -async-proofs-j 1, which would be pretty good. I think 8 GB RAM might just barely be enough for -j2 -async-proofs-j 1.

view this post on Zulip Maxime Dénès (Nov 17 2022 at 09:17):

Karl Palmskog said:

right, yes, after Maxime added custom runners, we had more OOMs in the Coq opam archive. I think the old ones had at least 7 GB of memory, now it might be something like 4 GB.

Which tags were you using?

AFAICT, the default gitlab shared runner has 3.75 Gb RAM (https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html)

view this post on Zulip Maxime Dénès (Nov 17 2022 at 09:17):

Oh sorry, I see that @Paolo Giarrusso pointed to the same link

view this post on Zulip Maxime Dénès (Nov 17 2022 at 09:18):

So I'm a bit lost: did I fail to reproduce the same setup as before?

view this post on Zulip Théo Zimmermann (Nov 17 2022 at 17:21):

Indeed, for some reason the setup is not the same as before AFAIU, but I don't know why.

view this post on Zulip Théo Zimmermann (Nov 18 2022 at 11:27):

@Maxime Dénès @Gaëtan Gilbert Would it be fine to re-enable shared runners for the Coq repository to boost the PR processing cycle during the release process? I can monitor closely our usage and we can decide upfront a limit that we do not want to go over (e.g., 30%).

view this post on Zulip Gaëtan Gilbert (Nov 18 2022 at 11:58):

do as you like, just tell us when you change something
we were fine wasting many percent last month so 30% is pretty conservative imo

view this post on Zulip Théo Zimmermann (Nov 18 2022 at 11:59):

Done.

view this post on Zulip Théo Zimmermann (Nov 18 2022 at 17:21):

So we are indeed burning CI minutes at a fast rate with shared runners on. We jumped from about 6% use this morning to 26%.

view this post on Zulip Théo Zimmermann (Nov 18 2022 at 21:51):

We've already reached 34% use. I've disabled the shared runners. Apparently, this doesn't stop the current running jobs (which is rather good in this case).

view this post on Zulip Maxime Dénès (Dec 07 2022 at 13:35):

I'm adding more runners

view this post on Zulip Gaëtan Gilbert (Dec 07 2022 at 13:48):

what's our current strategy for jobs that need priviledged docker? https://gitlab.com/coq/coq/-/jobs/3433179724 is stuck

view this post on Zulip Maxime Dénès (Dec 07 2022 at 13:55):

The strategy I suggested was to save our shared runner minutes for them

view this post on Zulip Gaëtan Gilbert (Dec 07 2022 at 14:28):

ok I'm enabling shared runners until they pick up the job then disabling them

view this post on Zulip Gaëtan Gilbert (Dec 07 2022 at 14:29):

it got picked up fast wow

view this post on Zulip Gaëtan Gilbert (Dec 08 2022 at 16:40):

Maxime Dénès said:

I'm adding more runners

where are they coming from btw?

view this post on Zulip Maxime Dénès (Dec 08 2022 at 17:32):

Inria's CloudStack

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 09:07):

We got an extra allowance?

view this post on Zulip Maxime Dénès (Dec 09 2022 at 09:07):

?

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 09:36):

Ah no, these were the 8 stopped runners, right?

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 09:43):

Does it mean that coq-docker-machine-driver should be removed?

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 09:44):

Also, it would be really good to get documentation of this infrastructure since it looks like it's going to remain like this indefinitely. This would be helpful so that others that you Maxime can act when there are maintenance issues.

view this post on Zulip Maxime Dénès (Dec 09 2022 at 09:54):

Théo Zimmermann said:

Does it mean that coq-docker-machine-driver should be removed?

Yes.

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 16:27):

@Maxime Dénès A job failed with "no space left on device". I think that's on one of the new runners: https://gitlab.com/coq/coq/-/jobs/3451330182


Last updated: Feb 02 2023 at 13:03 UTC