Runner failures · Coq devs & plugin devs

Stream: Coq devs & plugin devs

Topic: Runner failures

Théo Zimmermann (Oct 14 2022 at 07:14):

Hi @Maxime Dénès, we are seeing an anormaly high number of runner failures. coqbot is not robust to these situations because it tries to restart automatically the jobs that failed with such errors. I had to cancel all running pipelines.

Théo Zimmermann (Oct 14 2022 at 07:16):

Anormaly high = all jobs have been failing for the last ~hour I think.

Maxime Dénès (Oct 14 2022 at 07:16):

Ouch. Do you have an example of a failure? I'll investigate

Théo Zimmermann (Oct 14 2022 at 07:19):

Look for any job here: https://gitlab.com/coq/coq/-/jobs

Théo Zimmermann (Oct 14 2022 at 07:19):

The only jobs that did not fail are the ones that ran on #18070127 (dQN1ZPjc) coq-docker-machine-driver.

Maxime Dénès (Oct 14 2022 at 07:23):

Oh but that's most of the runners which were ok, then.

Maxime Dénès (Oct 14 2022 at 07:23):

It would have been better to just pause the two problematic runners IMHO (the ones which run on pendulum).

Théo Zimmermann (Oct 14 2022 at 07:23):

Yes, sorry I had to react in urgency.

Théo Zimmermann (Oct 14 2022 at 07:24):

Maxime Dénès said:

Oh but that's most of the runners which were ok, then.

~~What? Not at all!~~

Maxime Dénès (Oct 14 2022 at 07:24):

Didn't you say all jobs taken by coq-docker-machine-driver were suceeding?

Maxime Dénès (Oct 14 2022 at 07:24):

That's the CloudStack orchestrator.

Théo Zimmermann (Oct 14 2022 at 07:24):

OK, maybe I am blinded by the fact that the list of jobs is flooded by the failing ones.

Maxime Dénès (Oct 14 2022 at 07:25):

Running docker machine, as the name indicates.

Maxime Dénès (Oct 14 2022 at 07:25):

So 7 / 9 runners were ok, IIUC.

Théo Zimmermann (Oct 14 2022 at 07:25):

But why do all the non-problematic jobs have exactly the same runner ID then?

Maxime Dénès (Oct 14 2022 at 07:25):

Because it's one docker-machine runner

Théo Zimmermann (Oct 14 2022 at 07:26):

Maxime Dénès (Oct 14 2022 at 07:26):

Which orchestrates 7 docker runners

Théo Zimmermann (Oct 14 2022 at 07:26):

Maxime Dénès (Oct 14 2022 at 07:26):

I paused the two pendulum runners I added yesterday, they must have a pb.

Théo Zimmermann (Oct 14 2022 at 07:27):

OK, I will restart the cancelled pipelines then.

Maxime Dénès (Oct 14 2022 at 07:27):

You can do one and see if it looks ok

Maxime Dénès (Oct 14 2022 at 07:28):

It is a bit unfortunate that I have to add workers on the side, but we are using all our CloudStack quotas and I didn't hear back from the order on external cloud providers (despite asking), so there's not much I can do to add capacity to my docker-machine stuff.

Maxime Dénès (Oct 14 2022 at 07:30):

Yesterday I fixed one of the bench machines which hadn't been working for a while, I'll try to add runners on it too.

Théo Zimmermann (Oct 14 2022 at 07:33):

A good thing about how GitLab works is that when I restart cancelled pipelines, it only does so for failed or cancelled jobs.

Théo Zimmermann (Oct 14 2022 at 07:33):

So far, it seems that there are no longer any runner failures.

Maxime Dénès (Oct 14 2022 at 07:34):

Pendulum's hard drive is dying, so I'm not surprised we see issues. I'll switch these runners to the bench machines instead.

Maxime Dénès (Oct 14 2022 at 07:34):

(keeping only 2 machines dedicated to benches should be enough until we fix these CI capacity issues)

Maxime Dénès (Oct 14 2022 at 07:36):

Are you sure we want coqbot to restart pipelines on system failures?

Théo Zimmermann (Oct 14 2022 at 07:37):

Not sure it is a good idea anymore.

Théo Zimmermann (Oct 14 2022 at 07:37):

When we started using GitLab CI, runner failures were common.

Théo Zimmermann (Oct 14 2022 at 07:37):

And this was a way of avoiding the hassle of manually restarting them.

Théo Zimmermann (Oct 14 2022 at 07:37):

But with our own runners, it doesn't seem such a good thing any longer.

Théo Zimmermann (Oct 14 2022 at 07:38):

So I'll push a commit disabling this feature, at least for now.

Maxime Dénès (Oct 14 2022 at 07:38):

It could still be that we have such spurious failures, but it's hard to tell, we can try and see.

Ali Caglayan (Oct 14 2022 at 13:38):

Is it me or are none of the runners running?

Ali Caglayan (Oct 14 2022 at 13:38):

https://gitlab.com/coq/coq/-/jobs

Ali Caglayan (Oct 14 2022 at 13:38):

All jobs are pending

Ali Caglayan (Oct 14 2022 at 13:39):

Oh nevermind I see that there are running pipelines https://gitlab.com/coq/coq/-/pipelines?page=1&scope=all

Ali Caglayan (Oct 14 2022 at 13:39):

However this is quite the backlog

Théo Zimmermann (Oct 14 2022 at 13:42):

A big backlog indeed.

Maxime Dénès (Oct 14 2022 at 13:42):

I will try to add a bunch of runners today, but I have a few things to finish first.

Théo Zimmermann (Oct 14 2022 at 13:42):

The list of running jobs can be accessed like this BTW: https://gitlab.com/coq/coq/-/jobs?statuses=RUNNING

Maxime Dénès (Oct 14 2022 at 13:43):

I should be able to add 12 runners, so it should make a difference.

Théo Zimmermann (Oct 14 2022 at 13:48):

Is it normal that there are only 7 concurrent jobs running, not 8?

Ali Caglayan (Oct 14 2022 at 13:48):

@Maxime Dénès Are we able to see the runners from GitLab?

Maxime Dénès (Oct 14 2022 at 13:51):

When they exist, yes, why?

Maxime Dénès (Oct 14 2022 at 13:51):

Yes I set the limit to 7, because there was a race condition when we were getting too close to our quota (which should allow 8 in parallel, in principle)

Théo Zimmermann (Oct 14 2022 at 13:52):

@Ali Caglayan Note sure what you meant but we have our own runners listed here: https://gitlab.com/groups/coq/-/runners (not sure you can access that page)

Ali Caglayan (Oct 14 2022 at 13:53):

Théo Zimmermann said:

Ali Caglayan Note sure what you meant but we have our own runners listed here: https://gitlab.com/groups/coq/-/runners (not sure you can access that page)

Right that was the page I am after thanks

Maxime Dénès (Oct 14 2022 at 13:53):

@Théo Zimmermann You can see the orchestration, its settings and its hacks here: https://gitlab.inria.fr/mdenes/docker-machine-cloudstack

Maxime Dénès (Oct 14 2022 at 13:55):

Obviously not a finished product. I believe it took more than a year to develop it for the Inria runners, I had 48h, so you can imagine :)

Ali Caglayan (Oct 14 2022 at 13:55):

It says two of the runners are offline and there is a play button

Ali Caglayan (Oct 14 2022 at 13:55):

Should I press the play button to start it up again?

Gaëtan Gilbert (Oct 14 2022 at 13:55):

Ali Caglayan (Oct 14 2022 at 13:56):

well no harm in asking :)

Maxime Dénès (Oct 14 2022 at 13:56):

it depends if you want to break the whole CI like this morning

Maxime Dénès (Oct 14 2022 at 13:56):

I may be missing context, but what are you trying to do?

Ali Caglayan (Oct 14 2022 at 13:56):

Why did the CI break this morning?

Ali Caglayan (Oct 14 2022 at 13:56):

https://gitlab.com/groups/coq/-/runners

Ali Caglayan (Oct 14 2022 at 13:57):

The runners listed here, two of them say they are offline

Maxime Dénès (Oct 14 2022 at 13:57):

I know, I paused them this morning

Gaëtan Gilbert (Oct 14 2022 at 13:57):

don't start paused runners unless you know why they are paused
the reason is usually that they're broken

Maxime Dénès (Oct 14 2022 at 13:57):

What are you trying to do? Hard to help you if we don't know :)

Ali Caglayan (Oct 14 2022 at 13:58):

I'm not doing anything just asking whats what

Maxime Dénès (Oct 14 2022 at 14:01):

The CI is in a bit of crisis mode since the start of the month, so we are prioritizing operations over documentation of the set-up. It should be addressed once we stabilize the ship.

Pierre-Marie Pédrot (Oct 14 2022 at 14:03):

"ZO RELAXEN UND WATSCHEN DER BLINKENLICHTEN."

Maxime Dénès (Oct 14 2022 at 22:40):

Maxime Dénès said:

I should be able to add 12 runners, so it should make a difference.

done

Ali Caglayan (Oct 16 2022 at 10:41):

Nix job here ran out of space: https://gitlab.com/coq/coq/-/jobs/3179427160

Jason Gross (Oct 16 2022 at 15:28):

The opam-coq-archive jobs also seem to be out of space https://github.com/coq/opam-coq-archive/pull/2353#issuecomment-1279992161

Théo Zimmermann (Oct 16 2022 at 15:59):

I have paused ci-coq-01-runner-02 and ci-coq-01-runner-03 since they are the runners on which space issues have been observed so far.

Théo Zimmermann (Oct 16 2022 at 16:00):

I have no idea if there is a shared underlying machine and if I should be pausing more runners linked to the same machine.

Gaëtan Gilbert (Oct 16 2022 at 16:36):

ci-coq-01:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             32G     0   32G   0% /dev
tmpfs           6.3G  8.8M  6.3G   1% /run
/dev/sda3       865G  177G  644G  22% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/sda1       240M  5.1M  235M   3% /boot/efi
tmpfs           6.3G     0  6.3G   0% /run/user/650364
tmpfs           6.3G     0  6.3G   0% /run/user/656069

doesn't seem full

Maxime Dénès (Oct 17 2022 at 06:39):

That's the host AFAICT

Maxime Dénès (Oct 17 2022 at 06:42):

Indeed, some are full, I'll investigate why the clean-up job wasn't enough or wasn't working

Maxime Dénès (Oct 17 2022 at 06:43):

Ok, there is a permission problem for this clean-up job

Maxime Dénès (Oct 17 2022 at 06:49):

@Théo Zimmermann You can now unpause all ci-coq-* runners

Maxime Dénès (Oct 17 2022 at 06:49):

I fixed the clean up job

Théo Zimmermann (Oct 17 2022 at 07:03):

@Maxime Dénès please do. I'm not in front on a computer.

Maxime Dénès (Oct 17 2022 at 07:19):

Ok, done.

Last updated: Apr 19 2024 at 00:02 UTC