Stream: Coq devs & plugin devs

Topic: Runner failures


view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:14):

Hi @Maxime Dénès, we are seeing an anormaly high number of runner failures. coqbot is not robust to these situations because it tries to restart automatically the jobs that failed with such errors. I had to cancel all running pipelines.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:16):

Anormaly high = all jobs have been failing for the last ~hour I think.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:16):

Ouch. Do you have an example of a failure? I'll investigate

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:19):

Look for any job here: https://gitlab.com/coq/coq/-/jobs

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:19):

The only jobs that did not fail are the ones that ran on #18070127 (dQN1ZPjc) coq-docker-machine-driver.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:23):

Oh but that's most of the runners which were ok, then.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:23):

It would have been better to just pause the two problematic runners IMHO (the ones which run on pendulum).

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:23):

Yes, sorry I had to react in urgency.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:24):

Maxime Dénès said:

Oh but that's most of the runners which were ok, then.

What? Not at all!

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:24):

Didn't you say all jobs taken by coq-docker-machine-driver were suceeding?

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:24):

That's the CloudStack orchestrator.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:24):

OK, maybe I am blinded by the fact that the list of jobs is flooded by the failing ones.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:25):

Running docker machine, as the name indicates.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:25):

So 7 / 9 runners were ok, IIUC.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:25):

But why do all the non-problematic jobs have exactly the same runner ID then?

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:25):

Because it's one docker-machine runner

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:26):

OK

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:26):

Which orchestrates 7 docker runners

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:26):

OK

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:26):

I paused the two pendulum runners I added yesterday, they must have a pb.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:27):

OK, I will restart the cancelled pipelines then.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:27):

You can do one and see if it looks ok

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:28):

It is a bit unfortunate that I have to add workers on the side, but we are using all our CloudStack quotas and I didn't hear back from the order on external cloud providers (despite asking), so there's not much I can do to add capacity to my docker-machine stuff.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:30):

Yesterday I fixed one of the bench machines which hadn't been working for a while, I'll try to add runners on it too.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:33):

A good thing about how GitLab works is that when I restart cancelled pipelines, it only does so for failed or cancelled jobs.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:33):

So far, it seems that there are no longer any runner failures.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:34):

Pendulum's hard drive is dying, so I'm not surprised we see issues. I'll switch these runners to the bench machines instead.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:34):

(keeping only 2 machines dedicated to benches should be enough until we fix these CI capacity issues)

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:36):

Are you sure we want coqbot to restart pipelines on system failures?

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:37):

Not sure it is a good idea anymore.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:37):

When we started using GitLab CI, runner failures were common.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:37):

And this was a way of avoiding the hassle of manually restarting them.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:37):

But with our own runners, it doesn't seem such a good thing any longer.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 07:38):

So I'll push a commit disabling this feature, at least for now.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 07:38):

It could still be that we have such spurious failures, but it's hard to tell, we can try and see.

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:38):

Is it me or are none of the runners running?

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:38):

https://gitlab.com/coq/coq/-/jobs

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:38):

All jobs are pending

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:39):

Oh nevermind I see that there are running pipelines https://gitlab.com/coq/coq/-/pipelines?page=1&scope=all

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:39):

However this is quite the backlog

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 13:42):

A big backlog indeed.

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:42):

I will try to add a bunch of runners today, but I have a few things to finish first.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 13:42):

The list of running jobs can be accessed like this BTW: https://gitlab.com/coq/coq/-/jobs?statuses=RUNNING

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:43):

I should be able to add 12 runners, so it should make a difference.

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 13:48):

Is it normal that there are only 7 concurrent jobs running, not 8?

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:48):

@Maxime Dénès Are we able to see the runners from GitLab?

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:51):

When they exist, yes, why?

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:51):

Yes I set the limit to 7, because there was a race condition when we were getting too close to our quota (which should allow 8 in parallel, in principle)

view this post on Zulip Théo Zimmermann (Oct 14 2022 at 13:52):

@Ali Caglayan Note sure what you meant but we have our own runners listed here: https://gitlab.com/groups/coq/-/runners (not sure you can access that page)

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:53):

Théo Zimmermann said:

Ali Caglayan Note sure what you meant but we have our own runners listed here: https://gitlab.com/groups/coq/-/runners (not sure you can access that page)

Right that was the page I am after thanks

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:53):

@Théo Zimmermann You can see the orchestration, its settings and its hacks here: https://gitlab.inria.fr/mdenes/docker-machine-cloudstack

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:55):

Obviously not a finished product. I believe it took more than a year to develop it for the Inria runners, I had 48h, so you can imagine :)

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:55):

It says two of the runners are offline and there is a play button

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:55):

Should I press the play button to start it up again?

view this post on Zulip Gaëtan Gilbert (Oct 14 2022 at 13:55):

no

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:56):

well no harm in asking :)

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:56):

it depends if you want to break the whole CI like this morning

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:56):

I may be missing context, but what are you trying to do?

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:56):

Why did the CI break this morning?

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:56):

https://gitlab.com/groups/coq/-/runners

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:57):

The runners listed here, two of them say they are offline

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:57):

I know, I paused them this morning

view this post on Zulip Gaëtan Gilbert (Oct 14 2022 at 13:57):

don't start paused runners unless you know why they are paused
the reason is usually that they're broken

view this post on Zulip Maxime Dénès (Oct 14 2022 at 13:57):

What are you trying to do? Hard to help you if we don't know :)

view this post on Zulip Ali Caglayan (Oct 14 2022 at 13:58):

I'm not doing anything just asking whats what

view this post on Zulip Maxime Dénès (Oct 14 2022 at 14:01):

The CI is in a bit of crisis mode since the start of the month, so we are prioritizing operations over documentation of the set-up. It should be addressed once we stabilize the ship.

view this post on Zulip Pierre-Marie Pédrot (Oct 14 2022 at 14:03):

"ZO RELAXEN UND WATSCHEN DER BLINKENLICHTEN."

view this post on Zulip Maxime Dénès (Oct 14 2022 at 22:40):

Maxime Dénès said:

I should be able to add 12 runners, so it should make a difference.

done

view this post on Zulip Ali Caglayan (Oct 16 2022 at 10:41):

Nix job here ran out of space: https://gitlab.com/coq/coq/-/jobs/3179427160

view this post on Zulip Jason Gross (Oct 16 2022 at 15:28):

The opam-coq-archive jobs also seem to be out of space https://github.com/coq/opam-coq-archive/pull/2353#issuecomment-1279992161

view this post on Zulip Théo Zimmermann (Oct 16 2022 at 15:59):

I have paused ci-coq-01-runner-02 and ci-coq-01-runner-03 since they are the runners on which space issues have been observed so far.

view this post on Zulip Théo Zimmermann (Oct 16 2022 at 16:00):

I have no idea if there is a shared underlying machine and if I should be pausing more runners linked to the same machine.

view this post on Zulip Gaëtan Gilbert (Oct 16 2022 at 16:36):

ci-coq-01:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             32G     0   32G   0% /dev
tmpfs           6.3G  8.8M  6.3G   1% /run
/dev/sda3       865G  177G  644G  22% /
tmpfs            32G     0   32G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/sda1       240M  5.1M  235M   3% /boot/efi
tmpfs           6.3G     0  6.3G   0% /run/user/650364
tmpfs           6.3G     0  6.3G   0% /run/user/656069

doesn't seem full

view this post on Zulip Maxime Dénès (Oct 17 2022 at 06:39):

That's the host AFAICT

view this post on Zulip Maxime Dénès (Oct 17 2022 at 06:42):

Indeed, some are full, I'll investigate why the clean-up job wasn't enough or wasn't working

view this post on Zulip Maxime Dénès (Oct 17 2022 at 06:43):

Ok, there is a permission problem for this clean-up job

view this post on Zulip Maxime Dénès (Oct 17 2022 at 06:49):

@Théo Zimmermann You can now unpause all ci-coq-* runners

view this post on Zulip Maxime Dénès (Oct 17 2022 at 06:49):

I fixed the clean up job

view this post on Zulip Théo Zimmermann (Oct 17 2022 at 07:03):

@Maxime Dénès please do. I'm not in front on a computer.

view this post on Zulip Maxime Dénès (Oct 17 2022 at 07:19):

Ok, done.


Last updated: Dec 07 2023 at 17:01 UTC