Stream: Coq devs & plugin devs

Topic: CI pipeline minutes running out


view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 14:46):

It's becoming urgent, I just got (and probably so you did) a gitlab email scaringly stating

coq namespace has 30% or less Shared Runner Pipeline minutes remaining. Once it runs out, no new jobs or pipelines in its projects will run.

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 14:50):

I think they changed how they count CI usage
counted usage:
usage.png
actual time spent:
time.png
it used to be time = usage * 25 but this month it's time = usage * 2
so now we are running out

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 14:51):

see also https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html#cost-factor

0.5 for public projects in the GitLab for Open Source program.
1 for other public projects, after October 1, 2022 (previously 0.04).

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 15:04):

we could try to have core devs open forks and run their pipelines there (forks get 125 factor * 2k free minutes instead of 2 factor * 50k free minutes)
the permission juggling could get complex though
we could buy more (we seem to be using about $50 / day) I don't know where the money would come from though
we could try running to gitlab.inria
we could try getting our own machines or vms somewhere

anyway at the current rate CI will break in a couple days

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:14):

cc @Maxime Dénès (you probably need to be aware, especially if the solution ends up being buying runners somewhere)

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:15):

I feel stabbed in the back by GitLab because they have not announced this change of factors for public projects widely enough that we noticed it (they could have send an email before it was almost too late!).

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:16):

Maybe it's time to consider implementing https://github.com/coq/bot/issues/159 urgently.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:18):

I propose to urgently disable most CI jobs by default (depending on the value of a variable).

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:18):

Then we can add the necessary support to coqbot to handle this variable automatically.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 15:18):

Is migrating some of the jobs to GitHub Actions a potential solution for the short term?

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:24):

Not if we don't get some GitHub sponsorship IMO, otherwise our CI will be quickly throttled by the 4 parallel job limit.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:31):

According to the new cost factor, we are allowed 100,000 shared runner minutes. Assuming that we do not change the way opam-coq-archive works and based on the September stats (26,000 minutes for opam-coq-archive, 782,000 minutes for Coq), we should divide our CI consumption by more than a factor 10 on the Coq repository to stay in the new limits.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:32):

Unfortunately, even if we could achieve this instantly, it would have to be even more (like a factor 30) for the month of October, since we've already spent close to 70% of the available minutes in the first three days.

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 15:37):

:bomb:

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 15:37):

So this is no-CI October then?

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:38):

I actually think it's a realistic target to achieve if we cut on the long-running jobs most of the time.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:38):

That's why I was saying:

I propose to urgently disable most CI jobs by default (depending on the value of a variable).

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 15:39):

Urgently, as in now?

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 15:40):

If we consumed 70% in 3 days, that's about 1% per hour.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:40):

Yes, as in today.

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 15:41):

Who's doing that? I'm personally not CI-aware enough to write a patch, and this is a red alert :rotating_light:

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:41):

I can

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 15:42):

please!

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 15:49):

I propose to merge your PR as soon as its CI finishes.

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 15:50):

(Unironically, that is.)

view this post on Zulip Maxime Dénès (Oct 03 2022 at 15:52):

Do we still use Inria VMs for our CI?

view this post on Zulip Maxime Dénès (Oct 03 2022 at 15:53):

If not I can look into adding some runners there this week

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 15:54):

we don't

view this post on Zulip Maxime Dénès (Oct 03 2022 at 15:54):

the only remaining VM we have is coq-osx1, is it safe to delete?

view this post on Zulip Maxime Dénès (Oct 03 2022 at 15:55):

it eats up a part of our memory quota

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 15:56):

Let's discuss this in a call, but I think we can:

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 15:56):

@Théo Zimmermann , instead of disabling jobs, why don't we tag them so they are only picked by our runners?

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:57):

Maxime Dénès said:

the only remaining VM we have is coq-osx1, is it safe to delete?

It is not. AFAICT, it is still used to store the macOS signing key (Michael downloads it on its computer to sign packages, then deletes it).

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 15:58):

There are some other choices to further optimize this as discussed in the https://github.com/coq/coq/issues/16201

view this post on Zulip Maxime Dénès (Oct 03 2022 at 15:58):

Ok, noted. We still have enough quota to create some additional runners, though.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 15:58):

Emilio Jesús Gallego Arias said:

Théo Zimmermann , instead of disabling jobs, why don't we tag them so they are only picked by our runners?

If we don't disable most CI jobs now, we will have lost all of our remaining CI minutes. But of course, I'm in favor of deploying our own runners to pick up some of the load as soon as we can realistically do that.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:00):

Maxime Dénès said:

If not I can look into adding some runners there this week

It is not worth it to do so with the VM infra, it is better to use directly the runners implemented in gitlab.inria.fr

view this post on Zulip Maxime Dénès (Oct 03 2022 at 16:00):

Emilio Jesús Gallego Arias said:

Let's discuss this in a call, but I think we can:

do you have a pointer to the branch?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:01):

Théo Zimmermann said:

If we don't disable most CI jobs now, we will have lost all of our remaining CI minutes. But of course, I'm in favor of deploying our own runners to pick up some of the load as soon as we can realistically do that.

Well, either by disabling most of CI or by running out of minutes, we still lose CI so we are in shit.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:01):

Maxime Dénès said:

Emilio Jesús Gallego Arias said:

Let's discuss this in a call, but I think we can:

do you have a pointer to the branch?

Yup one se

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:01):

One sec

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:01):

I think that the fastest is to migrate to gitlab.inria.fr

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:01):

provided Inria allows us for enough quote

view this post on Zulip Guillaume Melquiond (Oct 03 2022 at 16:02):

Emilio Jesús Gallego Arias said:

It is not worth it to do so with the VM infra, it is better to use directly the runners implemented in gitlab.inria.fr

No, that is not possible. The shared runners available in the Inria Gitlab are just a handful. We will bring the whole infrastructure to its knees if we dump the whole Coq CI on them.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:02):

https://gitlab.inria.fr/egallego/coq/-/pipelines/545950

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:02):

That's the lastest run I think

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:03):

Guillaume Melquiond said:

Emilio Jesús Gallego Arias said:

It is not worth it to do so with the VM infra, it is better to use directly the runners implemented in gitlab.inria.fr

No, that is not possible. The shared runners available in the Inria Gitlab are just a handful. We will bring the whole infrastructure to its knees if we dump the whole Coq CI on them.

@Guillaume Melquiond do you have a pointer to the resource assignation? I have a good idea on the computer power we need, it is actually not so much.

view this post on Zulip Guillaume Melquiond (Oct 03 2022 at 16:03):

The current state is of the Inria runners is: "Pour l'instant il y a 8 slots pour le runner small, 4 pour le medium et 11 pour le large".

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:04):

But are these real or virtual slots?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:04):

I mean per-project

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 16:04):

Emilio Jesús Gallego Arias said:

https://gitlab.inria.fr/egallego/coq/-/pipelines/545950

404

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:04):

Permission error...

view this post on Zulip Guillaume Melquiond (Oct 03 2022 at 16:05):

No, that is for the whole instance. There are no more than 11 jobs in parallel that can be run globally.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:05):

I store private branches there so I can't make ejgallego/coq public, let me see

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:06):

I think that the GitLab CI fallout is anyway a reminder that we have been burning lots of energy without paying enough attention and I honestly believe that the only solution is to be more careful about how we use CI.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:07):

https://github.com/ejgallego/coq/tree/inria-ci that's the last version I tried

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:07):

Note that what I proposed is not to lose our CI but to not run the full version by default at every push on any PR.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:07):

Yes

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:08):

@Gaëtan Gilbert you can push that branch to your own fork on inria CI and it should produce a CI

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:08):

I can't open my own repos right now

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:09):

Théo Zimmermann said:

I think that the GitLab CI fallout is anyway a reminder that we have been burning lots of energy without paying enough attention and I honestly believe that the only solution is to be more careful about how we use CI.

Yes, there is a lot we can optimize actually

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:10):

We can't really blame gitlab for restricting this

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:11):

Free CI can't be unlimited, the economics just can't work

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:11):

Note that even with the tweak to the gitlab files, all PRs currently open and not rebased on master will trigger a full build.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:11):

We can't really blame gitlab for restricting this

No, we can't. The only thing I blame them for is not communicating more clearly a bit ahead of time.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:12):

Pierre-Marie Pédrot said:

Note that even with the tweak to the gitlab files, all PRs currently open and not rebased on master will trigger a full build.

Hum, no. coqbot does a merge with master and pushes that commit to the GitLab mirror.

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:12):

Ah, that's a good news in an ocean of trouble then.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:13):

I'm surprised to hear that our load (which is not so big IMHO) would bring the Inria CI infra to its knees tho

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:13):

I revised my estimations but a single reasonably modern server should handle our load just fine

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:14):

Maybe it's time to whine to the inventeurs du monde numérique?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 16:15):

Do we not have the budget to buy more minutes?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 16:17):

Also moving to GitHub actions could be possible since there is an "enterprise plan" allowing 180 concurrent workers.
https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration

view this post on Zulip Ali Caglayan (Oct 03 2022 at 16:18):

However we would lose the ability to inspect artifacts

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:18):

Pierre-Marie Pédrot said:

Maybe it's time to whine to the inventeurs du monde numérique?

For sure if not for what Guillaume said, that would be a very good solution, as gitlab.inria.fr would be a drop-in replacement for us

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:18):

so we would just need to push my branch to Coq, and update the bot to link to gitlab.inria.fr

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:18):

but indeed we need to whine^W discuss with the people in charge

view this post on Zulip Karl Palmskog (Oct 03 2022 at 16:19):

I understand this is way down the priority ladder, but this affects the Coq opam archive CI too, right?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:19):

@Gaëtan Gilbert do you mind pushing the inria-ci branch to your own gitlab.inria.fr fork so we can get an Inria run

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:19):

?

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 16:20):

what's the point?
I think everyone believes that it works

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:20):

Not @Guillaume Melquiond it seems. I'd be curious to see if the prophecy is true. 2022/10/03, aka "The day the Inria Gitlab stood still".

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:21):

Karl Palmskog said:

I understand this is way down the priority ladder, but this affects the Coq opam archive CI too, right?

Yes, but the opam-coq-archive CI is not so big. If we were to remove or sufficiently scale down the Coq CI, then opam-coq-archive could continue its business-as-usual.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:21):

Gaëtan Gilbert said:

what's the point?
I think everyone believes that it works

I think everybody believes that it works at a small scale. The question is just whether it will handle the full scale.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:22):

But it is easy enough to test actually.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:22):

We can just create an Inria GitLab mirror for Coq now and double-push every PR to both the current repo and the new one (and set FULL_CI=true over there).

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:22):

Then we can see if we bring Inria's infrastructure down to its knees or not.

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:23):

The Inria admins will surely appreciate.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:23):

Gaëtan Gilbert said:

what's the point?
I think everyone believes that it works

No, it doesn't fully work, we hit some limits on artifact size

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:24):

A single push will work modulo the artifact size problem, but of course we should talk to the responsible person before pushing dozens of jobs

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:25):

The question is what is the least amount of work? Is talking with the Inria admins for the CI the quickest path?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:25):

I don't know

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:26):

I think exploring multiple routes in parallel is probably the safest path.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:26):

also more costly tho

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:28):

@Théo Zimmermann Don't you also want to disable the test-suite+async job? This one never terminates on time and it takes uncompressible hours for no good reason.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:30):

Have we done some math to understand if this disable CI work is gonna work?

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:30):

No time for maths, this is the STOP blinking red button.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:30):

I am afraid that in the end it is just wasted work, if even after the saving it doesn't work out for us and we are left without CI

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:30):

Unfortunately, I haven't precise data enough to do this math, but from some plots by @Jason Gross that I have seen, I think that this should work.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:31):

This is the STOP blinking button, but imagine we are not pressing the right STOP button but the AUTO-DESTRUCT button

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:31):

Emilio Jesús Gallego Arias said:

I am afraid that in the end it is just wasted work, if even after the saving it doesn't work out for us and we are left without CI

This was some work that I wanted to do anyway. I just didn't want to do it now.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:31):

so we rather read the labels on the buttons carefully before pushing

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:31):

Recall that this issue https://github.com/coq/bot/issues/159 was opened in 2021 after a discussion that happened in 2020.

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:31):

Any button is good to press when you're bound to fall into the void in a split second.

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:32):

Merge first, talk later.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:32):

Théo Zimmermann said:

Emilio Jesús Gallego Arias said:

I am afraid that in the end it is just wasted work, if even after the saving it doesn't work out for us and we are left without CI

This was some work that I wanted to do anyway. I just didn't want to do it now.

Yes, tying to figure out what to build is a very interesting topic, we have been discussing about it in Coq-Universe too.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:33):

Pierre-Marie Pédrot said:

Any button is good to press when you're bound to fall into the void in a split second.

Yes, but are we bound to fall into the void in a split second? I don't see such urgency here.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:33):

Do we all agree that we don't want to drop CI for merges to main, right?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:34):

If so, let's keep calm and try to assess what our best path is.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:35):

Using FULL_CI + the bot could be a good stragegy, or maybe it doesn't work because we will still run out of minutes

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:35):

gitlab.inria.fr seems way more promising but we need to understand if Inria can provide us the right capacity

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:36):

Deploying our runner + tagging works could be a good solution, but I'm not sure if that's possible / what is the status of servers

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 16:36):

Acting in panic rarely leads to a good outcome

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:38):

Emilio Jesús Gallego Arias said:

Using FULL_CI + the bot could be a good stragegy, or maybe it doesn't work because we will still run out of minutes

I don't know if it will be a good long-term strategy or not but it buys us some time to consider which strategy is best.

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:40):

Honestly I don't see what could go wrong with deactivating everything. Thanks to the magic of VCS, worst case we can revert or tweak the result.

view this post on Zulip Pierre-Marie Pédrot (Oct 03 2022 at 16:41):

But really, the thing we don't have right now is time. If 1% goes off every hour we're really screwed.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:44):

It's not really 1% every hour BTW: if someone opens a PR now, it probably will be more than 1% spent.

view this post on Zulip Karl Palmskog (Oct 03 2022 at 16:46):

in that case, maybe consider disallowing PRs via GitHub for a while rather than have people open them by "mistake"

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:47):

BTW, I just computed how much it would cost paying for the minutes we need without doing anything. That would be about €3500/month. Technically, we could pay for one month until we figure this out. That's still a lot of money to throw away I think.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:49):

Karl Palmskog said:

in that case, maybe consider disallowing PRs via GitHub for a while rather than have people open them by "mistake"

No need now: https://github.com/coq/coq/pull/16600 was merged.

view this post on Zulip Ali Caglayan (Oct 03 2022 at 16:50):

I found this announcement from last year: https://about.gitlab.com/blog/2021/11/11/public-project-minute-limits/

view this post on Zulip Ali Caglayan (Oct 03 2022 at 16:50):

Did something change on top of this?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 16:50):

I thought Coq would be exempt from the minute counting.

view this post on Zulip Guillaume Melquiond (Oct 03 2022 at 16:55):

Pierre-Marie Pédrot said:

Not Guillaume Melquiond it seems. I'd be curious to see if the prophecy is true. 2022/10/03, aka "The day the Inria Gitlab stood still".

Let us be clear. There is currently a single 64-core server that is used for the whole Inria CI. It won't be long before we are forbidden to use it, if we are constantly hammering it with jobs from Coq's CI.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:57):

@Ali Caglayan See https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs-.26-plugin-devs/topic/Too.20many.20pipelines/near/302079565

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:57):

This clearly changed on Oct. 1st

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 16:57):

Although we were not aware of this change having been announced.

view this post on Zulip Ali Caglayan (Oct 03 2022 at 16:58):

Can we contact GitLab since this is somewhat absurd?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 17:02):

Could even be an engineer accidentally omitted the 5.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 17:03):

They have updated the doc, so it looks like a change on purpose.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 17:03):

But sure, if there were no announcements, we should probably contact them to say that we need time to adjust.

view this post on Zulip Théo Zimmermann (Oct 03 2022 at 17:04):

I personally haven't researched if there were any announcement or twitter discussion about this.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 17:07):

We would only max out a 64-core server on special weeks, for the rest that'd be fine

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 17:09):

Pierre-Marie Pédrot said:

Honestly I don't see what could go wrong with deactivating everything. Thanks to the magic of VCS, worst case we can revert or tweak the result.

Not sure I understand what you propose here. Deactivating everything means we don't have CI, so who's gonna do the testing / revert when a PR lands and a breaks everything

view this post on Zulip Emilio Jesús Gallego Arias (Oct 03 2022 at 17:09):

If I understand currently, as the PR has been merged, we should block PR merge too?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 17:12):

There is a "talk to an expert" button on https://about.gitlab.com/sales/. Can somebody ask them what is going on?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 17:13):

something like this: (I won't send it)

Hi, you seem to have changed how the minutes for the runners are being calculated. The doc also reflects this change.

I find this somewhat absurd as there was no announcement to the said change and it is heavily affecting our development CI pipeline.

Since the Coq organization is a paying for the runner services, it seems somewhat predatory that the calculations for minutes can be changed as such.

Please clarify the situation.

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 17:46):

we're not paying afaik

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:24):

Like @Guillaume Melquiond, I'm not confident at all that Inria's Gitlab CI will handle the load, especially if several large projects migrate simultaneously

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:26):

What I would do, is in parallel:

I can initiate both.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:34):

I think we can sequence the operations:

  1. Restore our current CI, by adding workers
  2. Improving on our usage of resources and target (better coqbot workflow, gitlab.inria.fr target, etc)

I agree it's better to not rush 2 too much. Doing a bit of temporary work for 1 does not seem problematic to me.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:37):

Yes, we need to address it. Still, TV news are not mentioning that we have CI issues, so I guess we can take the time to do things in order :)

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:41):

Let's create a channel for communication actions that we take on this CI issue. We will use this one to discuss about the situation and what should be done.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:44):

Was there some communication done to coq devs about the process they should currently follow to merge PRs?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 18:55):

Should I refrain from pushing to PRs for the time being?

view this post on Zulip Maxime Dénès (Oct 03 2022 at 18:58):

@Gaëtan Gilbert Should @Ali Caglayan rebase his PRs to get only the reduced CI or not?

view this post on Zulip Ali Caglayan (Oct 03 2022 at 18:59):

The PR I am thinking about already has a lot of the jobs removed, but I am unsure how severe the situation is today. I don't want to use up resources that could be better used for others.

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 19:14):

Maxime Dénès said:

Gaëtan Gilbert Should Ali Caglayan rebase his PRs to get only the reduced CI or not?

no need to rebase, coqbot auto merges

view this post on Zulip Gaëtan Gilbert (Oct 03 2022 at 19:15):

Ali Caglayan said:

Should I refrain from pushing to PRs for the time being?

push on, it will help tell us if we reduced enough

view this post on Zulip Ali Caglayan (Oct 03 2022 at 19:30):

Looks reduced https://gitlab.com/coq/coq/-/pipelines/657296145

view this post on Zulip Michael Soegtrop (Oct 03 2022 at 19:52):

I came a bit late to this thread. I guess Coq Platform is not affected since I am using only GitHub actions?

view this post on Zulip Maxime Dénès (Oct 03 2022 at 19:53):

I guess so, this seems to be a gitlab.com policy change.

view this post on Zulip Erik Martin-Dorel (Oct 03 2022 at 19:57):

That's very bad news :-(

BTW regarding the docker-coq infrastructure in coq-community: the docker-keeper tooling specifically relies on GitLab CI.
Migrating to GitHub Actions is just not feasible (e.g. YAML anchors are not supported by GHA, nor the generation of children pipelines from a dynamic YAML…)

view this post on Zulip Erik Martin-Dorel (Oct 03 2022 at 19:57):

Fortunately, the auto-build/deployment of coqorg/* images should not be as costly in CI/CD minutes as PRs pipelines.
But I cannot check the quotas myself in https://gitlab.com/groups/coq-community/-/usage_quotas#pipelines-quota-tab as only @Théo Zimmermann can → you may want to take a look at the row for coq-community / docker-coq, in order to extrapolate

view this post on Zulip Ali Caglayan (Oct 03 2022 at 20:00):

Here is the GitLab issue with the changes: https://gitlab.com/gitlab-org/gitlab/-/issues/337245

view this post on Zulip Erik Martin-Dorel (Oct 03 2022 at 20:27):

I googled a bit more and, according to this blog article https://about.gitlab.com/blog/2021/11/11/public-project-minute-limits/ that I had not spotted before, this drastic change is intended to fight cryptomining via CI pipelines.
But that's a pity that FOSS projects like Coq which consume much more pipeline time that the average of other projects, are badly impacted by these changes… Maybe, reaching them by the email opensource@gitlab.com mentioned in https://about.gitlab.com/handbook/marketing/community-relations/opensource-program/ to just let us know (?)
my 2 cents…

view this post on Zulip Erik Martin-Dorel (Oct 03 2022 at 20:33):

Finally FYI, it seems that there's another thing to double-check now:
the Namespace Storage Quota: https://gitlab.com/groups/math-comp/-/usage_quotas#storage-quota-tab

as according to https://docs.gitlab.com/ee/user/usage_quotas.html#namespace-storage-limit-enforcement-schedule

Storage limits for GitLab SaaS Free tier namespaces will not be enforced prior to 2022-10-19. Storage limits for GitLab SaaS Paid tier namespaces will not be enforced for prior to 2023-02-15.

IINM, coq and coq-community are not impacted for now, unlike math-comp, which is still in the Free Tier…
(see e.g. this topic)

view this post on Zulip Michael Soegtrop (Oct 03 2022 at 20:35):

I don't quite see how this change keeps people from crypto mining in CI if forks remain at 0.008 - who prevents people of forking their own projects? I guess one could find more intelligent methods for preventing such abuse.

view this post on Zulip Michael Soegtrop (Oct 03 2022 at 20:37):

Maybe it is possible to talk to them and get an exception.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 23:01):

Ok, I got something automated enough (snapshots + Ansible playbook). I should be able to add ~8 runners tomorrow.

view this post on Zulip Maxime Dénès (Oct 03 2022 at 23:03):

Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?

view this post on Zulip Guillaume Melquiond (Oct 04 2022 at 00:46):

Michael Soegtrop said:

Maybe it is possible to talk to them and get an exception.

I very much doubt so. Our quota of 50'000 minutes corresponds to their "ultimate" offer, which normally costs $99 per user and which we are currently getting for free. There is not even a higher tier for paying customers. So, I don't see how they could provide an even higher tier for non-paying customers. Also keep in mind that their "free" offer is just 400 minutes, so we are currently getting 125x that. In other words, we are very lucky and we should not push our luck.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:02):

Maxime Dénès said:

Was there some communication done to coq devs about the process they should currently follow to merge PRs?

Communication happened by this message: https://github.com/coq/coq/pull/16600#issuecomment-1265773392. I know it's not ideal, but any dev with push right should have got it.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:02):

Gaëtan Gilbert said:

that's actually a setting https://docs.gitlab.com/ee/ci/pipelines/job_artifacts.html#keep-artifacts-from-most-recent-successful-jobs so maybe we should unset it

Yes, we should definitely use it. Something changed recently? It didn't use to be the case that old PRs have their artifacts kept.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:02):

Erik Martin-Dorel said:

Fortunately, the auto-build/deployment of coqorg/* images should not be as costly in CI/CD minutes as PRs pipelines.
But I cannot check the quotas myself in https://gitlab.com/groups/coq-community/-/usage_quotas#pipelines-quota-tab as only Théo Zimmermann can → you may want to take a look at the row for coq-community / docker-coq, in order to extrapolate

The entire coq-community organization record CI consumption is 18,000 minutes in Feb 2022. We are allowed 100,000. So no, Docker-Coq will not be affected. The case of the Coq project is really special since we are spending so much.
Regarding the storage use, it is 6.4 GiB for coq-community including 3.41 GiB for Docker-Coq artifacts and 2.8 GiB for the coq-performance-tests repository (we could stop mirroring it there if needed).

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:03):

Guillaume Melquiond said:

Michael Soegtrop said:

Maybe it is possible to talk to them and get an exception.

I very much doubt so. Our quota of 50'000 minutes corresponds to their "ultimate" offer, which normally costs $99 per user and which we are currently getting for free. There is not even a higher tier for paying customers. So, I don't see how they could provide an even higher tier for non-paying customers. Also keep in mind that their "free" offer is just 400 minutes, so we are currently getting 125x that. In other words, we are very lucky and we should not push our luck.

The only thing that might be reasonable to ask is for a delay in order to resize our CI infrastructure.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:03):

Emilio Jesús Gallego Arias said:

We would only max out a 64-core server on special weeks, for the rest that'd be fine

Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average. Given our memory consumption, it's not clear that 18.5 runners would fit on a single 64-core server (it would need lots of RAM) and this is not even the usage peak, this is the average over a full month, days and nights.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:03):

Maxime Dénès said:

Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?

Why should we disable shared runners in case you add VMs without tags? BTW, 8 runners won't be sufficient to entirely replace our shared runner usage if we go back to our full CI according to my calculation above.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 07:06):

Ok, so we'd need a combination of reduced CI + self hosted runners, right?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 07:07):

(and studying a migration to gitlab.inria.fr in parallel, that is talking with the guys about the capacity of their service)

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:12):

Yes, I think this is a good summary.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:48):

Théo Zimmermann said:

Emilio Jesús Gallego Arias said:

We would only max out a 64-core server on special weeks, for the rest that'd be fine

Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average. Given our memory consumption, it's not clear that 18.5 runners would fit on a single 64-core server (it would need lots of RAM) and this is not even the usage peak, this is the average over a full month, days and nights.

Inria runners are considerably faster, so the amount of minutes would be less.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:49):

Also a few jobs are responsible of 60% build time (see the analysis when we discussed removing the async job)

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:50):

Coq Universe provides a good reference IMHO of how much is a full CI, around 1h in a 8 cores machine with 32GiB of RAM

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:50):

That's without caching

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:51):

in a Github actions runner, the vo build will take like 2.5 hours

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:51):

with just 2 cores

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:51):

with caching it varies

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:51):

Emilio Jesús Gallego Arias said:

Also a few jobs are responsible of 60% build time (see the analysis when we discussed removing the async job)

If you have such data available, I'm very interested to look at them.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:51):

The math is easy, look at the jobs and sort by time

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:53):

How do you do that?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:53):

The original thread is https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs-.26-plugin-devs/topic/CI.20for.20base.2Basync.20timing.20out

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 07:56):

Thanks! How did you get these figures? I can't seem to figure out how to extract them for a GitLab pipeline.

Total CI time without the heavy ones: 665
Total CI time with (unimath, vst, and fiat-crypto): 1155

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:59):

I just cut and paste + emacs macro

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 07:59):

link

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:01):

https://hackmd.io/PGrScXsoQVSLckPu3CXvIw

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:02):

Give me any pipeline and I will produce the table

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 08:02):

Maxime Dénès said:

Question: should I put them with tags (in which case we have to update the jobs) or without (in which case we disable shared runners)?

no tags imo

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:04):

Emilio Jesús Gallego Arias said:

Give me any pipeline and I will produce the table

So indeed we can optimize a lot, for example if we use dune, checking compilation of Coq ML part + all plugins is just a couple of minutes

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:04):

maybe 3, 4 max (depends on the runner)

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:05):

In my machine 22 secs

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 08:14):

the lint job does that and in your table it took 03:37

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 08:15):

it also does a bit of other stuff (whitespace checks etc)

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:41):

It took 3:37 total

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:42):

Boot time, checkout etc...

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:42):

I'm talking about the actual command only

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:44):

There is a lot variabilty in boot time too

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:44):

in runner boot time

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 08:46):

But total cpu time is almost 4 minutes

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:17):

We should also really do something about the test-suite jobs running forever because they start swapping or whatever.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:17):

In this reduced pipeline, the base test-sutie job was running for over 90 minutes before I killed it: https://gitlab.com/coq/coq/-/pipelines/657639626/builds

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:18):

Can you run your macros on this pipeline anyway?

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:19):

I'd like a comparison with the FULL_CI ones.

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 09:28):

Should we write a PR to get some logging information about the test-suite taking ages?

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 09:28):

Cancelling them after a timeout is a work around, but we should get to the root of the issue.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 09:29):

Some have hypothesized swapping. Could we auto-detect if swapping happens and kill the test-suite immediately in that case?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:48):

@Théo Zimmermann

| Status | Job | Stage | Name | Duration |
|--------|-----|-------|------|----------|
| passed | #3122706083 pr-16601 0183e2d4 | build | plugin:plugin-tutorial |  00:05:30 |
| passed | #3122706082 pr-16601 0183e2d4 | build | validate:vio |  00:02:32 |
| passed | #3122706080 pr-16601 0183e2d4 | build | validate:base |  00:04:45 |
| passed | #3122706079 pr-16601 0183e2d4 | build | test-suite:edge:dev |  00:25:58 |
| canceled | #3122706078 pr-16601 0183e2d4 | build | test-suite:base |  01:31:22 |
| passed | #3122706075 pr-16601 0183e2d4 | build | doc:ml-api:odoc |  00:04:11 |
| passed | #3122706070 pr-16601 0183e2d4 | build | doc:stdlib |  00:04:29 |
| passed | #3122706064 pr-16601 0183e2d4 | build | doc:refman-pdf |  00:11:08 |
| passed | #3122706061 pr-16601 0183e2d4 | build | doc:refman |  00:07:42 |
| passed | #3122706058 pr-16601 0183e2d4 | build | pkg:nix |  00:21:46 |
| passed | #3122706055 pr-16601 0183e2d4 | build | lint |  00:05:02 |
| passed | #3122706051 pr-16601 0183e2d4 | build | build:vio |  00:25:44 |
| passed | #3122706049 pr-16601 0183e2d4 | build | build:edge+flambda:dev |  00:18:48 |
| passed | #3122706046 pr-16601 0183e2d4 | build | build:base | 00:17:46 |

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:48):

oh sorry let me sort it

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:49):

| Status   | Job                           | Stage | Name                   | Duration |
|----------+-------------------------------+-------+------------------------+----------|
| canceled | #3122706078 pr-16601 0183e2d4 | build | test-suite:base        | 01:31:22 |
| passed   | #3122706079 pr-16601 0183e2d4 | build | test-suite:edge:dev    | 00:25:58 |
| passed   | #3122706051 pr-16601 0183e2d4 | build | build:vio              | 00:25:44 |
| passed   | #3122706058 pr-16601 0183e2d4 | build | pkg:nix                | 00:21:46 |
| passed   | #3122706049 pr-16601 0183e2d4 | build | build:edge+flambda:dev | 00:18:48 |
| passed   | #3122706046 pr-16601 0183e2d4 | build | build:base             | 00:17:46 |
| passed   | #3122706064 pr-16601 0183e2d4 | build | doc:refman-pdf         | 00:11:08 |
| passed   | #3122706061 pr-16601 0183e2d4 | build | doc:refman             | 00:07:42 |
| passed   | #3122706083 pr-16601 0183e2d4 | build | plugin:plugin-tutorial | 00:05:30 |
| passed   | #3122706055 pr-16601 0183e2d4 | build | lint                   | 00:05:02 |
| passed   | #3122706080 pr-16601 0183e2d4 | build | validate:base          | 00:04:45 |
| passed   | #3122706070 pr-16601 0183e2d4 | build | doc:stdlib             | 00:04:29 |
| passed   | #3122706075 pr-16601 0183e2d4 | build | doc:ml-api:odoc        | 00:04:11 |
| passed   | #3122706082 pr-16601 0183e2d4 | build | validate:vio           | 00:02:32 |

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:49):

I'd cut the builds except for the flambda + test suite flabmda

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:49):

build:vo and build:base

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:50):

and the validate jobs

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:50):

are a loss of time in a minimal CI

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 11:50):

and instead I'd add the composed plugin build

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 11:52):

This is somewhat making the CI useless to me at least. I can run the test-suite locally faster than what the CI provides.

view this post on Zulip Matthieu Sozeau (Oct 04 2022 at 11:53):

I’m just finding this out, shall I try to contact the guys at gitlab open source to tell them about our situation then?

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 11:56):

Sure, this is worth trying to at least alert them that we would appreciate some additional delay to sort things out.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 12:09):

Pierre-Marie Pédrot said:

This is somewhat making the CI useless to me at least. I can run the test-suite locally faster than what the CI provides.

I'm not sure I follow PM, of course local computers are faster than the VMs for free

view this post on Zulip Emilio Jesús Gallego Arias (Oct 04 2022 at 12:09):

Tho gitlab.inria large instances are much faster

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 12:13):

@Emilio Jesús Gallego Arias my typical use of the CI is to run stuff that is annoying to run. If we only keep a tiny subset of the CI, then it becomes trivial because I always run the test-suite locally. By contrast, running all the other tests are typically bothersome because you need some setup, network connection and parallel execution (if you go for the full CI with many devs).

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 12:13):

The question is not whether the free VM are slow, the question is the trade-off between the time where my PC is doing something vs. some stuff running overnight on the internet

view this post on Zulip Pierre-Marie Pédrot (Oct 04 2022 at 12:14):

For a CI I can run in less than 10 minutes on my PC, this is not very clear.

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:18):

I think we want to get some private runners (or use the gitlab.inria ones if there are enough) then put back most or all the jobs, possibly merging jobs if that exploits runner cpus better (ie coq-universe job instead of a bunch of jobs)

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:19):

the current job disabling I see as a temporary measure to avoid completely breaking and to keep opam-coq-archive who did nothing wrong working

view this post on Zulip Matthieu Sozeau (Oct 04 2022 at 12:20):

I sent a mail with a plea for a grace period, let's see what they say.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:42):

I added 8 runners, you may want to put back some jobs.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:44):

I will now initiate a discussion with our services about the capacity and security of gitlab.inria.fr.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 12:45):

Let's not hurry in putting back some jobs / check how our usage evolves. Since the discussion started less than 24h ago, we have still spend about 8% of our quotas.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:45):

It is not obvious that we can use it, because the current model is allowing only sponsored Inria accounts, whereas what we want to do is essentially to mirror arbitrary workloads.

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 12:46):

Maxime Dénès said:

It is not obvious that we can use it, because the current model is allowing only sponsored Inria accounts, whereas what we want to do is essentially to mirror arbitrary workloads.

What do you mean?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:46):

Aren't we exposing code execution on our runners to anybody?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:47):

Anyway let's discuss these aspects privately.

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:50):

what happened to coq-gitlab-runner2.ci ?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:51):

ah ah I was waiting for someone to ask

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:51):

are you sure you want to know?

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:52):

yes

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:53):

should we try disabling shared runners, enable full ci and see how big the queue gets?

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 12:53):

Sure, that's a good experiment to make!

view this post on Zulip Théo Zimmermann (Oct 04 2022 at 12:54):

And it's also easy to do: no PR to merge, just a variable to set.

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:54):

the CloudStack Inria API doesn't let you destroy resources completely, you have to wait for a 24h cron to do it

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:54):

I created a runner numbered 2 then deleted it, so I can't reuse the name yet

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:54):

similar pb for storage

view this post on Zulip Maxime Dénès (Oct 04 2022 at 12:55):

I don't know why they put this Jenkins-related logic in their Cloud backend, it is a very bad separation of concerns

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:55):

Théo Zimmermann said:

Sure, that's a good experiment to make!

let's go then

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 12:58):

@Maxime Dénès how beefy are the vms you added? should we increase the -j we use?

view this post on Zulip Maxime Dénès (Oct 04 2022 at 13:02):

they have 2 vCPUs, 4 Gb RAM

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 13:02):

already queuing in 1 pr https://gitlab.com/coq/coq/-/pipelines/657973015

view this post on Zulip Maxime Dénès (Oct 04 2022 at 13:02):

it was the closest I could find to the instances use by gitlab's shared runners

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 13:02):

current -j2 stays then

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 16:16):

the runners don't allow ulimit so all the scripts using it are failing, eg https://gitlab.com/coq/coq/-/jobs/3124630681

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 16:23):

also https://gitlab.com/coq/coq/-/jobs/3124630609 (ocamlopt fails in schemeequality.v) is interesting considering how that test kept hanging

view this post on Zulip Pierre Roux (Oct 04 2022 at 16:46):

If we need them to go through, we could deactivate native compilation, that was the only reason for ulimit IIRC.

view this post on Zulip Guillaume Melquiond (Oct 04 2022 at 16:59):

Gaëtan Gilbert said:

the runners don't allow ulimit

No, they do allow ulimit, but we have forbidden it. We need to modify .gitlab-ci.yml line 200 and add option -S.

view this post on Zulip Gaëtan Gilbert (Oct 04 2022 at 17:00):

https://github.com/coq/coq/pull/16612

view this post on Zulip Olivier Laurent (Oct 04 2022 at 19:59):

What should be done with ongoing PRs (in particular those ready to merge)? wait a bit that solutions are settled for CI troubles?
I am not sure I understand what is going on. Is https://gitlab.com/coq/coq/pipelines/new?var[FULL_CI]=true&ref=pr-XXXXX to be run or not?

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 07:15):

@Olivier Laurent at the moment the full CI pipelines are on again by default but that may not last. If you want to merge a PR, just check that the GitLab pipeline status check has the usual long list of job. If yes, it can be merged. If not, it should get a full CI run before.

view this post on Zulip Thierry Martinez (Oct 05 2022 at 08:41):

Maxime Dénès said:

the CloudStack Inria API doesn't let you destroy resources completely, you have to wait for a 24h cron to do it

If there is a need to go faster, you can ask a CI admin to destroy the resource before the cron job (but, of course, it only works if a CI admin is available... I was attending a training course the two last days).

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:04):

Théo Zimmermann said:

Our peaks are at 800,000 shared runner minutes per month, so 555 days, so 18.5 runners being used all the time on average.

After reading the documentation and double-checking on the usage webpage, our CI actually uses 1,600,000 minutes per month, but because we are an open source project, they only count for half of it: "The cost factors for jobs running on shared runners on GitLab.com are 0.5 for public projects in the GitLab for Open Source program." So, this amounts to 37 full-time runners on average.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:11):

@Guillaume Melquiond I think you've misunderstood how the usage is reported. There are two tabs: CI/CD minute usage and Shared runner duration. The latter is the real one and the former is how it is counted. Thus, in October we have used 80,000 actual shared runner minutes and they have counted as 40,000. In September, the cost factor was different so we have used 800,000 actual shared runner minutes and they have counted as 32,000.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:13):

Our peak was in March: 820,000 actual shared runner minutes.

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:14):

Right, my mistake. So, we are only using 800,000 minutes, but we still need to divide this usage by 8.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:15):

And for the month of October, this is even worse since we have already spent 80% of it.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:15):

So we should bring it down to basically zero on the rest of the month (or buy more minutes) if we want the opam-coq-archive CI to remain unaffected.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:16):

Disabling the shared runners on the Coq repository (like we did) is a good way of ensuring this.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:18):

Currently, there is no queue in Coq CI despite the full CI being on and the shared runners disabled, but that may be related to the ulimit issue (maybe that make jobs fail and thus stop quicker).

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:18):

Regarding opam-coq-archive, I am in the process of cutting part of its CI down, because even if it is a few orders of magnitude less than Coq itself, it is still blowing up our quota.

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:19):

(Almost 20% of the quota after just 5 days.)

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:19):

We could also consider migrating the opam-coq-archive CI to GitHub Actions. Do you think that it would be a realistic option?

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:21):

No idea, I don't understand how Github Actions work. (And there is still a quota for those, isn't it?)

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:25):

No AFAIU, there is no limit to how many CI minutes you can use. There is a limit to how many parallel runners you get (4 IIRC). So this is the latter which would make it impossible to use for Coq's CI without additional GitHub sponsorship.

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:25):

Ah no, it seems those are unlimited, as long as we don't ask for more than 2 cores.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:26):

"those" being?

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:27):

Github Actions. I was correcting myself.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:27):

https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits seems to indicate that the actual limit is 20 parallel runners on Linux / Windows and 5 on macOS.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:28):

They do not say the scope but I think it's probably the organization.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 12:29):

A good thing about GitHub Actions is that you get a 6-hour timeout per job, so that's better than what we have on GitLab with the shared runners.

view this post on Zulip Maxime Dénès (Oct 05 2022 at 12:35):

Thierry Martinez said:
If there is a need to go faster, you can ask a CI admin to destroy the resource before the cron job (but, of course, it only works if a CI admin is available... I was attending a training course the two last days).

So you confirm that this 24h delay exists? I received inconsistent answers on the matter. Why bake this CI-related logic in the Cloud backend, and not let us destroy resources that we created ourselves immediately?

view this post on Zulip Maxime Dénès (Oct 05 2022 at 12:37):

The whole point of CloudStack-like tools is to not have to open tickets for routine operations :)

view this post on Zulip Maxime Dénès (Oct 05 2022 at 12:38):

That ci.inria.fr has this kind of business logic I understand, but I'm very surprised that it ends up being restrictions on the CloudStack side.

view this post on Zulip Guillaume Melquiond (Oct 05 2022 at 12:51):

I can confirm that, a few months ago, this delay did exist. (I had to remove a VM because I was over-quota. The VM removal happened instantly, but the associated storage took 24 hours to disappear.)

view this post on Zulip Pierre-Marie Pédrot (Oct 05 2022 at 13:23):

has coqbot fallen to the same worker depletion or what? it doesn't react on https://github.com/coq/coq/pull/16611

view this post on Zulip Pierre-Marie Pédrot (Oct 05 2022 at 13:29):

maybe it's on strike

view this post on Zulip Pierre-Marie Pédrot (Oct 05 2022 at 13:32):

@Théo Zimmermann have you deactivated coqbot to prevent people merging carelessly w.r.t. the non-full CI?

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:43):

No, I didn't do anything.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:43):

I'll have a look :ambulance:

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:45):

That's a GitHub issue: https://www.githubstatus.com/

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:45):

Webhooks are not being delivered correctly.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:45):

So coqbot never received these messages.

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 13:46):

You can insist (some issue comments did trigger the webhook, but I don't know what is the current success rate) or wait for the webhooks to be delivered (they might be with some delay).

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 16:09):

Folks, one problem with disabling the shared runners vs tagging the jobs for our workers is that we can't generate docker images anymore

view this post on Zulip Théo Zimmermann (Oct 05 2022 at 16:20):

Yes, this was an experiment, but since it seems like we are going to keep this setup for a while, we should do a tag-based system to keep the shared runners available, but for Docker images only.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 16:54):

I'm gonna enable them briefly so my docker image is rebuilt

view this post on Zulip Maxime Dénès (Oct 05 2022 at 21:25):

It looks like the queue of jobs is increasing..

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 22:42):

Where do you see the queue?

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 22:45):

Here of course https://gitlab.com/coq/coq/-/jobs ...

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 22:45):

I guess we will be back a bit to the times we needed the night for CI to catch up

view this post on Zulip Maxime Dénès (Oct 06 2022 at 07:00):

Once we can estimate our needs, we should be able to order runners on an external Cloud provider if needed.

view this post on Zulip Maxime Dénès (Oct 06 2022 at 07:00):

But I find it also good to try to reduce a bit our usage, because of environmental concerns.

view this post on Zulip Maxime Dénès (Oct 06 2022 at 07:00):

(at least the lowest hanging fruit, like what Théo is implementing in coqbot)

view this post on Zulip Théo Zimmermann (Oct 06 2022 at 07:52):

It looks like the queue is still big after the night.

view this post on Zulip Théo Zimmermann (Oct 06 2022 at 07:53):

Hopefully, it will catch up now that we are running light CI pipelines by default on PRs.

view this post on Zulip Maxime Dénès (Oct 07 2022 at 14:11):

Btw @Thierry Martinez kindly improved the configuration of Inria's CloudStack, so there's hope we can make quick progress on the way these new runners are orchestrated.

view this post on Zulip Gaëtan Gilbert (Oct 10 2022 at 15:04):

Maxime Dénès said:

Ok, I got something automated enough (snapshots + Ansible playbook). I should be able to add ~8 runners tomorrow.

is your automation public?

view this post on Zulip Maxime Dénès (Oct 10 2022 at 15:17):

That one no, but I'm almost done with a better one, properly orchestrating ephemeral VMs being developed here: https://gitlab.inria.fr/mdenes/docker-machine-cloudstack

I will stabilize it, then replace the current runners with this orchestration. Warning: it is currently fully undocumented and moving and breaking often.

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:00):

The new runners seem to work fine so far. @Gaëtan Gilbert FYI, they should be able to handle the docker boot job as well.

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:00):

(not tested yet)

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:01):

It should be fairly easy to add compute from an external cloud provider using the same orchestration, btw.

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:03):

Oh, and also: these runners are now available to the coq group, not just the coq project. @Théo Zimmermann I think you needed it for another project, but I don't recall exactly.

view this post on Zulip Théo Zimmermann (Oct 11 2022 at 15:03):

Not me, the opam-coq-archive maintainers.

view this post on Zulip Théo Zimmermann (Oct 11 2022 at 15:04):

Well, technically, they were fine with the shared runners apparently (except for long-running jobs, but that's not new).

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:13):

@Thierry Martinez Thanks again for your help on this. I will try to document this orchestration and its deployment, in case it is useful to other projects.

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:25):

About the macos worker, if it is only used for storage occasionally, can I snapshot and delete it? @Théo Zimmermann @Michael Soegtrop

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:25):

It would probably free resources for two more runners.

view this post on Zulip Théo Zimmermann (Oct 11 2022 at 15:25):

:+1: from me if you have the :+1: from Michael

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:32):

We should probably also think about our usage of the bench machines. Am I right that they are idle most of the time?

view this post on Zulip Maxime Dénès (Oct 11 2022 at 15:32):

We could use them for CI when they are not used for benchmarking.

view this post on Zulip Théo Zimmermann (Oct 11 2022 at 15:34):

Absolutely, if you find a way of making sure that bench jobs can "lock" them (and of course they wait until they are fully "locked" before starting running).

view this post on Zulip Maxime Dénès (Oct 12 2022 at 11:24):

@Gaëtan Gilbert Am I right that pendulum is unused these days?

view this post on Zulip Maxime Dénès (Oct 12 2022 at 11:24):

(if so, I can add some runners on it to improve a bit our CI situation)

view this post on Zulip Gaëtan Gilbert (Oct 12 2022 at 11:29):

yes (it's set as paused in the runner list currently)

view this post on Zulip Jim Fehrle (Oct 18 2022 at 23:09):

A couple thoughts:

view this post on Zulip Théo Zimmermann (Nov 05 2022 at 13:22):

In the end, we finished October having only used 86% of our CI allowance and 5 days into November we have only used 1% (since shared runners are only enabled on opam-coq-archive). Any opposition to enabling shared runners back on the Coq repo as well and monitoring how it goes? (With the light pipelines by default and our own runners enabled, we do not risk CI minutes running out as quickly as in the beginning of October.)

view this post on Zulip Gaëtan Gilbert (Nov 09 2022 at 17:31):

would it be useful?

view this post on Zulip Théo Zimmermann (Nov 10 2022 at 07:47):

Well, sometimes pipelines are waiting quite a lot in queue because of limited runner availability. We also still haven't seen a fully green CI for several weeks because of fiat-crypto and VST failing on the new runners (or has this been solved now?). Furthermore, since these shared runners are free, there would not be much point in not using them if they can help.

view this post on Zulip Pierre-Marie Pédrot (Nov 14 2022 at 13:59):

VST CI should now be passing on master after https://github.com/coq/coq/pull/16795

view this post on Zulip Pierre-Marie Pédrot (Nov 14 2022 at 14:00):

but fiat-crypto is still using too much memory, as it seems that its custom makefile ignores the NJOBS option

view this post on Zulip Pierre-Marie Pédrot (Nov 14 2022 at 14:00):

CI has been broken for more than a month, we should really do something about it

view this post on Zulip Pierre-Marie Pédrot (Nov 14 2022 at 14:01):

there was the idea to use tags for memory-intensive runs, is it hard to implement?

view this post on Zulip Théo Zimmermann (Nov 15 2022 at 08:18):

I don't recall exactly how tags work, but probably we could do something like : tag our runners, tell them to only run tagged jobs, tag all our jobs with the same tag and make fiat-crypto and a few other jobs the only ones without tags, re-enable the shared runners so that we use them for these.

view this post on Zulip Gaëtan Gilbert (Nov 15 2022 at 09:14):

do the shared runners actually have more power? https://github.com/coq/coq/pull/16801#issuecomment-1314584174 sounds like fiat-crypto started using more memory

view this post on Zulip Karl Palmskog (Nov 15 2022 at 09:16):

I think the shared runners actually have less memory than GitHub runners. For example, in the Coq opam archive, we started seeing OOMs more frequently with shared runners

view this post on Zulip Théo Zimmermann (Nov 15 2022 at 13:03):

Karl Palmskog said:

I think the shared runners actually have less memory than GitHub runners. For example, in the Coq opam archive, we started seeing OOMs more frequently with shared runners

What do you call "GitHub runners" and what do you call "shared runners" exactly? It looks to me like what you want to say is that you started to see more OOMs with "custom runners" (the ones that were added by Maxime) compared to when we only used "shared runners" (the ones made available by GitLab). If that's correct, then this would match my impression of what has happened on the side of the Coq repo (where shared runners had no trouble with VST and fiat-crypto but the custom runners using the former bench machines did).

view this post on Zulip Karl Palmskog (Nov 15 2022 at 13:07):

right, yes, after Maxime added custom runners, we had more OOMs in the Coq opam archive. I think the old ones had at least 7 GB of memory, now it might be something like 4 GB.

view this post on Zulip Michael Soegtrop (Nov 15 2022 at 14:06):

4GB is indeed very tight for building Coq stuff and likely not very efficient because the machine will likely swap (with -j 2).

view this post on Zulip Paolo Giarrusso (Nov 16 2022 at 00:44):

https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html says "1 vCPU, 3.75GB RAM" (which is what I've seen). medium machines are available for free projects (enable with tags: saas-linux-medium-amd64), are twice as big, and consume twice as many minutes (https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html#additional-costs-on-gitlab-saas)

view this post on Zulip Michael Soegtrop (Nov 16 2022 at 08:48):

Do we know how much total RAM and CPU would be sufficient to run Coq CI reasonably efficiently?

view this post on Zulip Karl Palmskog (Nov 16 2022 at 08:56):

my experience is that 4 cores and something like 10-12 GB RAM works well

view this post on Zulip Karl Palmskog (Nov 16 2022 at 08:56):

but not sure this would be even enough that outliers like Fiat Crypto work as expected

view this post on Zulip Michael Soegtrop (Nov 16 2022 at 09:32):

I more meant how much total GB and total CPU we need to run Coq CI efficiently.

view this post on Zulip Jason Gross (Nov 16 2022 at 16:28):

Fiat Crypto currently peaks at a bit over 4GB per thread/process. Last I checked, our heaviest (> 2GB) files memory-wise are

  1m02.21s | 4379108 ko | Bedrock/End2End/RupicolaCrypto/Derive.vo
  3m06.05s | 2930324 ko | rupicola/bedrock2/compiler/src/compiler/FlatToRiscvFunctions.vo
  2m02.81s | 2751688 ko | rupicola/bedrock2/compiler/src/compiler/FlatToRiscvMetric.vo
  5m30.49s | 2649632 ko | Bedrock/Field/Synthesis/Examples/p224_64_new.vo
 10m08.34s | 2582048 ko | Bedrock/End2End/X25519/GarageDoor.vo
  4m17.69s | 2500784 ko | Assembly/WithBedrock/Proofs.vo
  1m49.62s | 2252340 ko | Fancy/Barrett256.vo
  0m34.38s | 2216436 ko | ExtractionOCaml/bedrock2_solinas_reduction.ml
  0m33.70s | 2199600 ko | ExtractionOCaml/with_bedrock2_word_by_word_montgomery.ml
  0m32.69s | 2199516 ko | ExtractionOCaml/bedrock2_word_by_word_montgomery.ml
  0m39.68s | 2147996 ko | ExtractionOCaml/word_by_word_montgomery
  0m44.54s | 2147880 ko | ExtractionOCaml/bedrock2_solinas_reduction
  0m43.08s | 2147548 ko | ExtractionOCaml/solinas_reduction
  1m56.73s | 2140160 ko | fiat-rust/src/p384_scalar_32.rs
  1m56.65s | 2139664 ko | fiat-java/src/FiatP384Scalar.java
  0m32.10s | 2130396 ko | ExtractionOCaml/solinas_reduction.ml
  0m16.34s | 2106988 ko | ExtractionHaskell/bedrock2_solinas_reduction.hs
  0m16.60s | 2106956 ko | ExtractionHaskell/with_bedrock2_solinas_reduction.hs
  0m31.79s | 2095372 ko | ExtractionOCaml/word_by_word_montgomery.ml
  0m29.60s | 2092128 ko | ExtractionOCaml/with_bedrock2_unsaturated_solinas.ml
  0m30.75s | 2091876 ko | ExtractionOCaml/bedrock2_unsaturated_solinas.ml
  1m56.18s | 2088112 ko | fiat-c/src/p384_scalar_32.c
  0m42.31s | 2088052 ko | ExtractionOCaml/bedrock2_word_by_word_montgomery
  0m41.92s | 2084496 ko | ExtractionOCaml/with_bedrock2_word_by_word_montgomery
  4m29.40s | 2077784 ko | Rewriter/Passes/ArithWithCasts.vo
  7m09.75s | 2070516 ko | Curves/Weierstrass/AffineProofs.vo
  1m56.19s | 2056320 ko | fiat-go/32/p384/p384.go
  1m30.01s | 2053492 ko | SlowPrimeSynthesisExamples.vo

If we're careful enough, 2GB--3GB per coq process is enough for most fiat-crypto files. I think much of fiat-crypto can usefully use 6--8 cores in parallel (and I've gone through some pains to achieve good file-level parallelism), though we are sometimes blocked on single slow files (like the 10m GarageDoor file that depends on almost everything else in the repo). I think with 14 GB RAM fiat-crypto could safely be built with -j4 -async-proofs-j 1, which would be pretty good. I think 8 GB RAM might just barely be enough for -j2 -async-proofs-j 1.

view this post on Zulip Maxime Dénès (Nov 17 2022 at 09:17):

Karl Palmskog said:

right, yes, after Maxime added custom runners, we had more OOMs in the Coq opam archive. I think the old ones had at least 7 GB of memory, now it might be something like 4 GB.

Which tags were you using?

AFAICT, the default gitlab shared runner has 3.75 Gb RAM (https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html)

view this post on Zulip Maxime Dénès (Nov 17 2022 at 09:17):

Oh sorry, I see that @Paolo Giarrusso pointed to the same link

view this post on Zulip Maxime Dénès (Nov 17 2022 at 09:18):

So I'm a bit lost: did I fail to reproduce the same setup as before?

view this post on Zulip Théo Zimmermann (Nov 17 2022 at 17:21):

Indeed, for some reason the setup is not the same as before AFAIU, but I don't know why.

view this post on Zulip Théo Zimmermann (Nov 18 2022 at 11:27):

@Maxime Dénès @Gaëtan Gilbert Would it be fine to re-enable shared runners for the Coq repository to boost the PR processing cycle during the release process? I can monitor closely our usage and we can decide upfront a limit that we do not want to go over (e.g., 30%).

view this post on Zulip Gaëtan Gilbert (Nov 18 2022 at 11:58):

do as you like, just tell us when you change something
we were fine wasting many percent last month so 30% is pretty conservative imo

view this post on Zulip Théo Zimmermann (Nov 18 2022 at 11:59):

Done.

view this post on Zulip Théo Zimmermann (Nov 18 2022 at 17:21):

So we are indeed burning CI minutes at a fast rate with shared runners on. We jumped from about 6% use this morning to 26%.

view this post on Zulip Théo Zimmermann (Nov 18 2022 at 21:51):

We've already reached 34% use. I've disabled the shared runners. Apparently, this doesn't stop the current running jobs (which is rather good in this case).

view this post on Zulip Maxime Dénès (Dec 07 2022 at 13:35):

I'm adding more runners

view this post on Zulip Gaëtan Gilbert (Dec 07 2022 at 13:48):

what's our current strategy for jobs that need priviledged docker? https://gitlab.com/coq/coq/-/jobs/3433179724 is stuck

view this post on Zulip Maxime Dénès (Dec 07 2022 at 13:55):

The strategy I suggested was to save our shared runner minutes for them

view this post on Zulip Gaëtan Gilbert (Dec 07 2022 at 14:28):

ok I'm enabling shared runners until they pick up the job then disabling them

view this post on Zulip Gaëtan Gilbert (Dec 07 2022 at 14:29):

it got picked up fast wow

view this post on Zulip Gaëtan Gilbert (Dec 08 2022 at 16:40):

Maxime Dénès said:

I'm adding more runners

where are they coming from btw?

view this post on Zulip Maxime Dénès (Dec 08 2022 at 17:32):

Inria's CloudStack

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 09:07):

We got an extra allowance?

view this post on Zulip Maxime Dénès (Dec 09 2022 at 09:07):

?

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 09:36):

Ah no, these were the 8 stopped runners, right?

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 09:43):

Does it mean that coq-docker-machine-driver should be removed?

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 09:44):

Also, it would be really good to get documentation of this infrastructure since it looks like it's going to remain like this indefinitely. This would be helpful so that others that you Maxime can act when there are maintenance issues.

view this post on Zulip Maxime Dénès (Dec 09 2022 at 09:54):

Théo Zimmermann said:

Does it mean that coq-docker-machine-driver should be removed?

Yes.

view this post on Zulip Théo Zimmermann (Dec 09 2022 at 16:27):

@Maxime Dénès A job failed with "no space left on device". I think that's on one of the new runners: https://gitlab.com/coq/coq/-/jobs/3451330182


Last updated: Dec 07 2023 at 14:02 UTC