What should we do about the recurring too many pipeline jobs error?
The limit is documented here https://docs.gitlab.com/ee/user/gitlab_com/index.html#gitlab-cicd
"Max jobs in active pipelines gitlab.com:500 for Free tier, unlimited otherwise self-hosted:unlimited"
I understand we could apply for some open source program and get that limit lifted too?
If you mean https://about.gitlab.com/solutions/open-source/program/ then "All public projects on GitLab.com automatically receive Gold functionality at the project level. Apply to this program if you need Gold functionality at the group level."
job limits are project level
Ok, just an idea, maybe we could just write to them? Documentation on that was always a bit imprecise, I see they have reworded it a bit tho.
I made https://gitlab.com/gitlab-org/gitlab/-/issues/240895
it sounds like we should apply for the program
That was my understanding last time I looked.
That we needed to apply
I suspect we probably don't want GitHub actions, as this will result in the CI taking 25x more real time to run (since we only get 20 concurrent jobs)
do we really get 500 jobs on gitlab? O_O
20 concurrent jobs is already pretty good. On GitLab, I feel like we frequently get that as well, but sometimes we also get a single free runner and most of our jobs are run by our own runners. If I'm not mistaken, GitHub also supports setting up your own runners.
https://forum.gitlab.com/t/not-treated-as-open-source-for-pipeline-job-limit/41938
Emilio Jesús Gallego Arias said:
That we needed to apply
ok shall we do it?
is there any blocker?
Not that I'm aware
I applied, we'll see
Note that they replied at https://gitlab.com/gitlab-org/gitlab/-/issues/240895#note_402263876:
"""
Ah apologies I didn't realise the project was public.
I'm going to ask around for you to best advise on the next steps.
As mentioned previously this isn't the best process to get the right answers for you. It certainly appears your project is on the free tier going by the behavior but I'm not able to say why this isn't the case sadly.
To ensure you're getting to speak to the right people to help with this further I would recommend either creating a thread about this on our support forum or to contact our Open Source team's email at opensource@gitlab.com. Hope this helps you get it sorted!
"""
So perhaps someone should email?
@Matthieu Sozeau any response to the application? we're still seeing the issue eg https://gitlab.com/coq/coq/-/pipelines/193464370
Nope, not yet. They said "allow 10 days for a response" and I sent it last wednesday
I just signed the papers to the Gold plan, so we should benefit from it anytime now
Does not seem to be the case yet: screenshot
Still the case today :frown:
And we still get the "job activity exceeded" failure: https://gitlab.com/coq/coq/-/pipelines/197875878
We're still on the free plan with recurrent "job activity exceeded" failures.
This is still an issue.
@Matthieu Sozeau can you ping gitlab about this?
Yes, sure
Done
(I pinged them again)
As a kludge, can coqbot restart pipelines that fail with this error?
The reason these pipelines fail is that we go over the limits for the Free plan (which should have been lifted if GitLab had really granted us the Gold OSS plan). We believe that the fact that restarting manually allows the pipeline to run despite this limitation is a GitLab bug and we are fortunate that this bug still exists. Therefore, we should avoid abusing it by systematically restarting the failed pipelines, which would put us well beyond the limits for the Free plan.
The plan to alleviate this problem was rather to reduce the number of jobs that are systematically run for each pipeline, with a dual mode, where, by default, we run a more lightweight pipeline, then upon request and before merging a PR, we run the full pipeline. This is not yet implemented because I had other priorities that interfered.
Indeed, @Théo Zimmermann gave a good summary, the current state of the job limit is a bit strange as even the direction / scope of the bug is not clear. Moreover, all CI systems are under attack due to crypto currency mining, so things may get quite worse for us, but fortunately our choice of using gitlab means we could provide a backup if needed.
Don't we have new workers available for this actually?
this is not related to workers
Even if we had an infinite number of workers that we own, we would still need to switch to our own GitLab (e.g. Inria's GitLab) or get the Gold plan to avoid this limit on the number of concurrent jobs.
I see
Sadly, they seem to be overwhelmed... no response still after a week
image.png Finally !
Wonderful! But given the actual start date, they could have adapted the end date :stuck_out_tongue:
I guess they'll renew easily, as we're linked to the Inria account
I can't seem to find a way to check that the pipeline limit has been unlifted, but we'll know soon enough :)
Indeed, the last known pipeline activity limit errors were on June 3rd.
Hurray!
I'm getting a prompt to renew on https://gitlab.com/coq/coq/-/pipelines
Screenshot_20210907_123410.png
it needs customer credentials, @Matthieu Sozeau can you handle it?
In another thread, Matthieu answered that the renewal was supposed to be automatic (but that he would check with GitLab support).
Done, I sent a renewal request, apparently they're in the process of automating this but it does not work yet for Gitlab for OSS projects
The renewal was signed today, so we should get no interuptions
@Matthieu Sozeau You said you handled the renewal this year again, right? GitLab still shows "Your Opensource subscription for coq will expire on 2022-09-29. After it expires, you can't use merge approvals, code quality, or many other features."
Yes, I signed the quote already. I’ll check with them
Have they replied to your inquiry? The message is still showing up and the expiration date is 2022-09-29, so in less than a week.
It's becoming urgent, I just got (and probably so you did) a gitlab email scaringly stating
coq namespace has 30% or less Shared Runner Pipeline minutes remaining. Once it runs out, no new jobs or pipelines in its projects will run.
I think they changed how they count CI usage
counted usage:
usage.png
actual time spent:
time.png
it used to be time = usage * 25
but this month it's time = usage * 2
so now we are running out
see also https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html#cost-factor
0.5 for public projects in the GitLab for Open Source program.
1 for other public projects, after October 1, 2022 (previously 0.04).
we could try to have core devs open forks and run their pipelines there (forks get 125 factor * 2k free minutes instead of 2 factor * 50k free minutes)
the permission juggling could get complex though
we could buy more (we seem to be using about $50 / day) I don't know where the money would come from though
we could try running to gitlab.inria
we could try getting our own machines or vms somewhere
anyway at the current rate CI will break in a couple days
cc @Maxime Dénès (you probably need to be aware, especially if the solution ends up being buying runners somewhere)
I feel stabbed in the back by GitLab because they have not announced this change of factors for public projects widely enough that we noticed it (they could have send an email before it was almost too late!).
Maybe it's time to consider implementing https://github.com/coq/bot/issues/159 urgently.
I propose to urgently disable most CI jobs by default (depending on the value of a variable).
Then we can add the necessary support to coqbot to handle this variable automatically.
Is migrating some of the jobs to GitHub Actions a potential solution for the short term?
Not if we don't get some GitHub sponsorship IMO, otherwise our CI will be quickly throttled by the 4 parallel job limit.
According to the new cost factor, we are allowed 100,000 shared runner minutes. Assuming that we do not change the way opam-coq-archive works and based on the September stats (26,000 minutes for opam-coq-archive, 782,000 minutes for Coq), we should divide our CI consumption by more than a factor 10 on the Coq repository to stay in the new limits.
Unfortunately, even if we could achieve this instantly, it would have to be even more (like a factor 30) for the month of October, since we've already spent close to 70% of the available minutes in the first three days.
:bomb:
So this is no-CI October then?
I actually think it's a realistic target to achieve if we cut on the long-running jobs most of the time.
That's why I was saying:
I propose to urgently disable most CI jobs by default (depending on the value of a variable).
Urgently, as in now?
If we consumed 70% in 3 days, that's about 1% per hour.
Yes, as in today.
Who's doing that? I'm personally not CI-aware enough to write a patch, and this is a red alert :rotating_light:
I can
please!
I propose to merge your PR as soon as its CI finishes.
(Unironically, that is.)
Do we still use Inria VMs for our CI?
If not I can look into adding some runners there this week
we don't
the only remaining VM we have is coq-osx1, is it safe to delete?
it eats up a part of our memory quota
Let's discuss this in a call, but I think we can:
@Théo Zimmermann , instead of disabling jobs, why don't we tag them so they are only picked by our runners?
Maxime Dénès said:
the only remaining VM we have is coq-osx1, is it safe to delete?
It is not. AFAICT, it is still used to store the macOS signing key (Michael downloads it on its computer to sign packages, then deletes it).
There are some other choices to further optimize this as discussed in the https://github.com/coq/coq/issues/16201
Ok, noted. We still have enough quota to create some additional runners, though.
Emilio Jesús Gallego Arias said:
Théo Zimmermann , instead of disabling jobs, why don't we tag them so they are only picked by our runners?
If we don't disable most CI jobs now, we will have lost all of our remaining CI minutes. But of course, I'm in favor of deploying our own runners to pick up some of the load as soon as we can realistically do that.
Maxime Dénès said:
If not I can look into adding some runners there this week
It is not worth it to do so with the VM infra, it is better to use directly the runners implemented in gitlab.inria.fr
Emilio Jesús Gallego Arias said:
Let's discuss this in a call, but I think we can:
- use gitlab.inria.fr (it worked fine, see my branch and the experiments), only problem is that we need to ask for a bit larger artifact size limit
- deploy our runners
do you have a pointer to the branch?
Théo Zimmermann said:
If we don't disable most CI jobs now, we will have lost all of our remaining CI minutes. But of course, I'm in favor of deploying our own runners to pick up some of the load as soon as we can realistically do that.
Well, either by disabling most of CI or by running out of minutes, we still lose CI so we are in shit.
Maxime Dénès said:
Emilio Jesús Gallego Arias said:
Let's discuss this in a call, but I think we can:
- use gitlab.inria.fr (it worked fine, see my branch and the experiments), only problem is that we need to ask for a bit larger artifact size limit
- deploy our runners
do you have a pointer to the branch?
Yup one se
One sec
I think that the fastest is to migrate to gitlab.inria.fr
provided Inria allows us for enough quote
Emilio Jesús Gallego Arias said:
It is not worth it to do so with the VM infra, it is better to use directly the runners implemented in gitlab.inria.fr
No, that is not possible. The shared runners available in the Inria Gitlab are just a handful. We will bring the whole infrastructure to its knees if we dump the whole Coq CI on them.
https://gitlab.inria.fr/egallego/coq/-/pipelines/545950
That's the lastest run I think
Guillaume Melquiond said:
Emilio Jesús Gallego Arias said:
It is not worth it to do so with the VM infra, it is better to use directly the runners implemented in gitlab.inria.fr
No, that is not possible. The shared runners available in the Inria Gitlab are just a handful. We will bring the whole infrastructure to its knees if we dump the whole Coq CI on them.
@Guillaume Melquiond do you have a pointer to the resource assignation? I have a good idea on the computer power we need, it is actually not so much.
The current state is of the Inria runners is: "Pour l'instant il y a 8 slots pour le runner small, 4 pour le medium et 11 pour le large".
But are these real or virtual slots?
I mean per-project
Emilio Jesús Gallego Arias said:
404
Permission error...
No, that is for the whole instance. There are no more than 11 jobs in parallel that can be run globally.
I store private branches there so I can't make ejgallego/coq public, let me see
I think that the GitLab CI fallout is anyway a reminder that we have been burning lots of energy without paying enough attention and I honestly believe that the only solution is to be more careful about how we use CI.
https://github.com/ejgallego/coq/tree/inria-ci that's the last version I tried
Note that what I proposed is not to lose our CI but to not run the full version by default at every push on any PR.
Yes
@Gaëtan Gilbert you can push that branch to your own fork on inria CI and it should produce a CI
I can't open my own repos right now
Théo Zimmermann said:
I think that the GitLab CI fallout is anyway a reminder that we have been burning lots of energy without paying enough attention and I honestly believe that the only solution is to be more careful about how we use CI.
Yes, there is a lot we can optimize actually
We can't really blame gitlab for restricting this
Free CI can't be unlimited, the economics just can't work
Note that even with the tweak to the gitlab files, all PRs currently open and not rebased on master will trigger a full build.
We can't really blame gitlab for restricting this
No, we can't. The only thing I blame them for is not communicating more clearly a bit ahead of time.
Pierre-Marie Pédrot said:
Note that even with the tweak to the gitlab files, all PRs currently open and not rebased on master will trigger a full build.
Hum, no. coqbot does a merge with master and pushes that commit to the GitLab mirror.
Ah, that's a good news in an ocean of trouble then.
I'm surprised to hear that our load (which is not so big IMHO) would bring the Inria CI infra to its knees tho
I revised my estimations but a single reasonably modern server should handle our load just fine
Maybe it's time to whine to the inventeurs du monde numérique?
Do we not have the budget to buy more minutes?
Also moving to GitHub actions could be possible since there is an "enterprise plan" allowing 180 concurrent workers.
https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration
However we would lose the ability to inspect artifacts
Pierre-Marie Pédrot said:
Maybe it's time to whine to the inventeurs du monde numérique?
For sure if not for what Guillaume said, that would be a very good solution, as gitlab.inria.fr would be a drop-in replacement for us
so we would just need to push my branch to Coq, and update the bot to link to gitlab.inria.fr
but indeed we need to whine^W discuss with the people in charge
I understand this is way down the priority ladder, but this affects the Coq opam archive CI too, right?
@Gaëtan Gilbert do you mind pushing the inria-ci
branch to your own gitlab.inria.fr fork so we can get an Inria run
?
what's the point?
I think everyone believes that it works
Not @Guillaume Melquiond it seems. I'd be curious to see if the prophecy is true. 2022/10/03, aka "The day the Inria Gitlab stood still".
Karl Palmskog said:
I understand this is way down the priority ladder, but this affects the Coq opam archive CI too, right?
Yes, but the opam-coq-archive CI is not so big. If we were to remove or sufficiently scale down the Coq CI, then opam-coq-archive could continue its business-as-usual.
Gaëtan Gilbert said:
what's the point?
I think everyone believes that it works
I think everybody believes that it works at a small scale. The question is just whether it will handle the full scale.
But it is easy enough to test actually.
We can just create an Inria GitLab mirror for Coq now and double-push every PR to both the current repo and the new one (and set FULL_CI=true
over there).
Then we can see if we bring Inria's infrastructure down to its knees or not.
The Inria admins will surely appreciate.
Gaëtan Gilbert said:
what's the point?
I think everyone believes that it works
No, it doesn't fully work, we hit some limits on artifact size
A single push will work modulo the artifact size problem, but of course we should talk to the responsible person before pushing dozens of jobs
The question is what is the least amount of work? Is talking with the Inria admins for the CI the quickest path?
I don't know
I think exploring multiple routes in parallel is probably the safest path.
also more costly tho
@Théo Zimmermann Don't you also want to disable the test-suite+async job? This one never terminates on time and it takes uncompressible hours for no good reason.
Have we done some math to understand if this disable CI work is gonna work?
No time for maths, this is the STOP blinking red button.
I am afraid that in the end it is just wasted work, if even after the saving it doesn't work out for us and we are left without CI
Unfortunately, I haven't precise data enough to do this math, but from some plots by @Jason Gross that I have seen, I think that this should work.
This is the STOP blinking button, but imagine we are not pressing the right STOP button but the AUTO-DESTRUCT button
Emilio Jesús Gallego Arias said:
I am afraid that in the end it is just wasted work, if even after the saving it doesn't work out for us and we are left without CI
This was some work that I wanted to do anyway. I just didn't want to do it now.
so we rather read the labels on the buttons carefully before pushing
Recall that this issue https://github.com/coq/bot/issues/159 was opened in 2021 after a discussion that happened in 2020.
Any button is good to press when you're bound to fall into the void in a split second.
Merge first, talk later.
Théo Zimmermann said:
Emilio Jesús Gallego Arias said:
I am afraid that in the end it is just wasted work, if even after the saving it doesn't work out for us and we are left without CI
This was some work that I wanted to do anyway. I just didn't want to do it now.
Yes, tying to figure out what to build is a very interesting topic, we have been discussing about it in Coq-Universe too.
Pierre-Marie Pédrot said:
Any button is good to press when you're bound to fall into the void in a split second.
Yes, but are we bound to fall into the void in a split second? I don't see such urgency here.
Do we all agree that we don't want to drop CI for merges to main, right?
If so, let's keep calm and try to assess what our best path is.
Using FULL_CI + the bot could be a good stragegy, or maybe it doesn't work because we will still run out of minutes
gitlab.inria.fr seems way more promising but we need to understand if Inria can provide us the right capacity
Deploying our runner + tagging works could be a good solution, but I'm not sure if that's possible / what is the status of servers
Acting in panic rarely leads to a good outcome
Emilio Jesús Gallego Arias said:
Using FULL_CI + the bot could be a good stragegy, or maybe it doesn't work because we will still run out of minutes
I don't know if it will be a good long-term strategy or not but it buys us some time to consider which strategy is best.
Honestly I don't see what could go wrong with deactivating everything. Thanks to the magic of VCS, worst case we can revert or tweak the result.
But really, the thing we don't have right now is time. If 1% goes off every hour we're really screwed.
It's not really 1% every hour BTW: if someone opens a PR now, it probably will be more than 1% spent.
in that case, maybe consider disallowing PRs via GitHub for a while rather than have people open them by "mistake"
BTW, I just computed how much it would cost paying for the minutes we need without doing anything. That would be about €3500/month. Technically, we could pay for one month until we figure this out. That's still a lot of money to throw away I think.
Karl Palmskog said:
in that case, maybe consider disallowing PRs via GitHub for a while rather than have people open them by "mistake"
No need now: https://github.com/coq/coq/pull/16600 was merged.
I found this announcement from last year: https://about.gitlab.com/blog/2021/11/11/public-project-minute-limits/
Did something change on top of this?
I thought Coq would be exempt from the minute counting.
Pierre-Marie Pédrot said:
Not Guillaume Melquiond it seems. I'd be curious to see if the prophecy is true. 2022/10/03, aka "The day the Inria Gitlab stood still".
Let us be clear. There is currently a single 64-core server that is used for the whole Inria CI. It won't be long before we are forbidden to use it, if we are constantly hammering it with jobs from Coq's CI.
@Ali Caglayan See https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs-.26-plugin-devs/topic/Too.20many.20pipelines/near/302079565
This clearly changed on Oct. 1st
Although we were not aware of this change having been announced.
Can we contact GitLab since this is somewhat absurd?
Could even be an engineer accidentally omitted the 5.
Last updated: Feb 06 2023 at 18:03 UTC