Hi @Maxime Dénès, we are seeing an anormaly high number of runner failures. coqbot is not robust to these situations because it tries to restart automatically the jobs that failed with such errors. I had to cancel all running pipelines.
Anormaly high = all jobs have been failing for the last ~hour I think.
Ouch. Do you have an example of a failure? I'll investigate
Look for any job here: https://gitlab.com/coq/coq/-/jobs
The only jobs that did not fail are the ones that ran on #18070127 (dQN1ZPjc) coq-docker-machine-driver.
Oh but that's most of the runners which were ok, then.
It would have been better to just pause the two problematic runners IMHO (the ones which run on pendulum).
Yes, sorry I had to react in urgency.
Maxime Dénès said:
Oh but that's most of the runners which were ok, then.
What? Not at all!
Didn't you say all jobs taken by coq-docker-machine-driver were suceeding?
That's the CloudStack orchestrator.
OK, maybe I am blinded by the fact that the list of jobs is flooded by the failing ones.
Running docker machine, as the name indicates.
So 7 / 9 runners were ok, IIUC.
But why do all the non-problematic jobs have exactly the same runner ID then?
Because it's one docker-machine runner
OK
Which orchestrates 7 docker runners
OK
I paused the two pendulum runners I added yesterday, they must have a pb.
OK, I will restart the cancelled pipelines then.
You can do one and see if it looks ok
It is a bit unfortunate that I have to add workers on the side, but we are using all our CloudStack quotas and I didn't hear back from the order on external cloud providers (despite asking), so there's not much I can do to add capacity to my docker-machine stuff.
Yesterday I fixed one of the bench machines which hadn't been working for a while, I'll try to add runners on it too.
A good thing about how GitLab works is that when I restart cancelled pipelines, it only does so for failed or cancelled jobs.
So far, it seems that there are no longer any runner failures.
Pendulum's hard drive is dying, so I'm not surprised we see issues. I'll switch these runners to the bench machines instead.
(keeping only 2 machines dedicated to benches should be enough until we fix these CI capacity issues)
Are you sure we want coqbot to restart pipelines on system failures?
Not sure it is a good idea anymore.
When we started using GitLab CI, runner failures were common.
And this was a way of avoiding the hassle of manually restarting them.
But with our own runners, it doesn't seem such a good thing any longer.
So I'll push a commit disabling this feature, at least for now.
It could still be that we have such spurious failures, but it's hard to tell, we can try and see.
Is it me or are none of the runners running?
https://gitlab.com/coq/coq/-/jobs
All jobs are pending
Oh nevermind I see that there are running pipelines https://gitlab.com/coq/coq/-/pipelines?page=1&scope=all
However this is quite the backlog
A big backlog indeed.
I will try to add a bunch of runners today, but I have a few things to finish first.
The list of running jobs can be accessed like this BTW: https://gitlab.com/coq/coq/-/jobs?statuses=RUNNING
I should be able to add 12 runners, so it should make a difference.
Is it normal that there are only 7 concurrent jobs running, not 8?
@Maxime Dénès Are we able to see the runners from GitLab?
When they exist, yes, why?
Yes I set the limit to 7, because there was a race condition when we were getting too close to our quota (which should allow 8 in parallel, in principle)
@Ali Caglayan Note sure what you meant but we have our own runners listed here: https://gitlab.com/groups/coq/-/runners (not sure you can access that page)
Théo Zimmermann said:
Ali Caglayan Note sure what you meant but we have our own runners listed here: https://gitlab.com/groups/coq/-/runners (not sure you can access that page)
Right that was the page I am after thanks
@Théo Zimmermann You can see the orchestration, its settings and its hacks here: https://gitlab.inria.fr/mdenes/docker-machine-cloudstack
Obviously not a finished product. I believe it took more than a year to develop it for the Inria runners, I had 48h, so you can imagine :)
It says two of the runners are offline and there is a play button
Should I press the play button to start it up again?
no
well no harm in asking :)
it depends if you want to break the whole CI like this morning
I may be missing context, but what are you trying to do?
Why did the CI break this morning?
https://gitlab.com/groups/coq/-/runners
The runners listed here, two of them say they are offline
I know, I paused them this morning
don't start paused runners unless you know why they are paused
the reason is usually that they're broken
What are you trying to do? Hard to help you if we don't know :)
I'm not doing anything just asking whats what
The CI is in a bit of crisis mode since the start of the month, so we are prioritizing operations over documentation of the set-up. It should be addressed once we stabilize the ship.
"ZO RELAXEN UND WATSCHEN DER BLINKENLICHTEN."
Maxime Dénès said:
I should be able to add 12 runners, so it should make a difference.
done
Nix job here ran out of space: https://gitlab.com/coq/coq/-/jobs/3179427160
The opam-coq-archive jobs also seem to be out of space https://github.com/coq/opam-coq-archive/pull/2353#issuecomment-1279992161
I have paused ci-coq-01-runner-02 and ci-coq-01-runner-03 since they are the runners on which space issues have been observed so far.
I have no idea if there is a shared underlying machine and if I should be pausing more runners linked to the same machine.
ci-coq-01:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 32G 0 32G 0% /dev
tmpfs 6.3G 8.8M 6.3G 1% /run
/dev/sda3 865G 177G 644G 22% /
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/sda1 240M 5.1M 235M 3% /boot/efi
tmpfs 6.3G 0 6.3G 0% /run/user/650364
tmpfs 6.3G 0 6.3G 0% /run/user/656069
doesn't seem full
That's the host AFAICT
Indeed, some are full, I'll investigate why the clean-up job wasn't enough or wasn't working
Ok, there is a permission problem for this clean-up job
@Théo Zimmermann You can now unpause all ci-coq-* runners
I fixed the clean up job
@Maxime Dénès please do. I'm not in front on a computer.
Ok, done.
Last updated: Oct 12 2024 at 13:01 UTC