Stream: Coq devs & plugin devs

Topic: Runner failures on docker-machine runner


view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:07):

Maxime Dénès said:

Are you sure we want coqbot to restart pipelines on system failures?

FWIW, I disabled this feature on coqbot yesterday, and today I notice that this pipeline had many runner failures:

https://gitlab.com/coq/coq/-/pipelines/667237493

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:08):

(with the docker runner)

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:08):

I don't know if it was some exceptional case or if is some common occurrence that we are just not used to seeing thanks to the coqbot feature.

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:13):

This one also had three such runner failures on the docker machine: https://gitlab.com/coq/coq/-/pipelines/667321297

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:22):

Yet another similar failure here: https://gitlab.com/coq/coq/-/jobs/3175950466

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:25):

Yet another pipeline with lots of such failures: https://gitlab.com/coq/coq/-/pipelines/667133094

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:27):

And yet another one: https://gitlab.com/coq/coq/-/pipelines/667059832

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:28):

Given the alarming rate of these runner failures, I am tempted to just reactivate the coqbot feature until these are investigated / solved.

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:28):

I just wish there was a simple way of detecting when coqbot is going into an infinite loop of job retries.

view this post on Zulip Gaëtan Gilbert (Oct 15 2022 at 10:35):

do reactivate imo
why was it deactivated?

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:55):

Done. It was deactivated following Maxime's question quoted above and my response quoted below:

Not sure it is a good idea anymore.
When we started using GitLab CI, runner failures were common.
And this was a way of avoiding the hassle of manually restarting them.
But with our own runners, it doesn't seem such a good thing any longer.
So I'll push a commit disabling this feature, at least for now.

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:57):

FWIW, the main issue with this feature is when runners are consistently failing and coqbot is retrying jobs like crazy. This is what happened yesterday morning. In this case, the only solution is to cancel pipelines or jobs to stop the loop.

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 10:58):

I haven't come up yet with a good idea of how we could detect and prevent this from happening.

view this post on Zulip Ali Caglayan (Oct 15 2022 at 14:16):

What about retrying once and then giving up?

view this post on Zulip Théo Zimmermann (Oct 15 2022 at 19:21):

Yes, that's what I would like to do, but how do you know if you have retried once or many times?

view this post on Zulip Gaëtan Gilbert (Oct 15 2022 at 19:26):

do we have access to the list of failed jobs from the bot?

view this post on Zulip Gaëtan Gilbert (Oct 15 2022 at 19:27):

I mean the list for the current pipeline not the global list

view this post on Zulip Théo Zimmermann (Oct 16 2022 at 09:44):

Indeed, thanks for the tip. I figured out how to do this with the GitLab GraphQL API: https://github.com/coq/bot/issues/239

view this post on Zulip Théo Zimmermann (Oct 19 2022 at 07:56):

There were again lots and lots of runner failures with the docker-machine runners yesterday evening. See the retried jobs in https://gitlab.com/coq/coq/-/jobs/3192171830.

view this post on Zulip Théo Zimmermann (Oct 19 2022 at 16:58):

It actually looks like the docker-machine runners are systematically failing currently. So I'll pause them until this can be investigated (cc @Maxime Dénès).

view this post on Zulip Théo Zimmermann (Oct 19 2022 at 16:59):

It's fortunate that we have two kinds of runners since they seem to fail repeatedly, but fortunately not at the same time!

view this post on Zulip Théo Zimmermann (Oct 19 2022 at 17:00):

And FTR, now coqbot does try to restart jobs that failed with runner failures but it stops after three retries.

view this post on Zulip Gaëtan Gilbert (Oct 19 2022 at 17:01):

we used to get tons of these failures on the shared runners, did that stop before we moved to our runners?

view this post on Zulip Théo Zimmermann (Oct 19 2022 at 18:44):

Yes, I think this was less frequent, but I don't have numbers. But anyway, here my problem is not that we are seeing some, but that they seem to be repeatable.

view this post on Zulip Gaëtan Gilbert (Oct 19 2022 at 19:09):

I definitely remember seeing significant strings of failed jobs a while back
but I haven't paid attention to the job list thing where they can be seen for a long time

view this post on Zulip Théo Zimmermann (Oct 19 2022 at 20:03):

I have observed occurrences of similar runner failures on docker-mathcomp (e.g., https://gitlab.com/math-comp/docker-mathcomp/-/jobs/3198410447) but there coqbot is forbidden to retry the failed job (code 403). Maybe it doesn't have an elevated enough permission level? cc @Erik Martin-Dorel

view this post on Zulip Maxime Dénès (Oct 20 2022 at 12:10):

I had a day off yesterday, will check what happens today

view this post on Zulip Théo Zimmermann (Oct 20 2022 at 16:20):

Note that coq-docker-machine-driver was still paused today. I leave it to you to unpause it and monitor what happens.

view this post on Zulip Erik Martin-Dorel (Oct 20 2022 at 23:17):

Hi, thanks for the ping, and sorry for not replying faster.
I was precisely planning to double-check the mathcomp CI config this week; so I'll let you know, @Théo Zimmermann


Last updated: Feb 02 2023 at 13:03 UTC