Maxime Dénès said:
Are you sure we want coqbot to restart pipelines on system failures?
FWIW, I disabled this feature on coqbot yesterday, and today I notice that this pipeline had many runner failures:
(with the docker runner)
I don't know if it was some exceptional case or if is some common occurrence that we are just not used to seeing thanks to the coqbot feature.
This one also had three such runner failures on the docker machine: https://gitlab.com/coq/coq/-/pipelines/667321297
Yet another similar failure here: https://gitlab.com/coq/coq/-/jobs/3175950466
Yet another pipeline with lots of such failures: https://gitlab.com/coq/coq/-/pipelines/667133094
And yet another one: https://gitlab.com/coq/coq/-/pipelines/667059832
Given the alarming rate of these runner failures, I am tempted to just reactivate the coqbot feature until these are investigated / solved.
I just wish there was a simple way of detecting when coqbot is going into an infinite loop of job retries.
do reactivate imo
why was it deactivated?
Done. It was deactivated following Maxime's question quoted above and my response quoted below:
Not sure it is a good idea anymore.
When we started using GitLab CI, runner failures were common.
And this was a way of avoiding the hassle of manually restarting them.
But with our own runners, it doesn't seem such a good thing any longer.
So I'll push a commit disabling this feature, at least for now.
FWIW, the main issue with this feature is when runners are consistently failing and coqbot is retrying jobs like crazy. This is what happened yesterday morning. In this case, the only solution is to cancel pipelines or jobs to stop the loop.
I haven't come up yet with a good idea of how we could detect and prevent this from happening.
What about retrying once and then giving up?
Yes, that's what I would like to do, but how do you know if you have retried once or many times?
do we have access to the list of failed jobs from the bot?
I mean the list for the current pipeline not the global list
Indeed, thanks for the tip. I figured out how to do this with the GitLab GraphQL API: https://github.com/coq/bot/issues/239
There were again lots and lots of runner failures with the docker-machine runners yesterday evening. See the retried jobs in https://gitlab.com/coq/coq/-/jobs/3192171830.
It actually looks like the docker-machine runners are systematically failing currently. So I'll pause them until this can be investigated (cc @Maxime Dénès).
It's fortunate that we have two kinds of runners since they seem to fail repeatedly, but fortunately not at the same time!
And FTR, now coqbot does try to restart jobs that failed with runner failures but it stops after three retries.
we used to get tons of these failures on the shared runners, did that stop before we moved to our runners?
Yes, I think this was less frequent, but I don't have numbers. But anyway, here my problem is not that we are seeing some, but that they seem to be repeatable.
I definitely remember seeing significant strings of failed jobs a while back
but I haven't paid attention to the job list thing where they can be seen for a long time
I have observed occurrences of similar runner failures on docker-mathcomp (e.g., https://gitlab.com/math-comp/docker-mathcomp/-/jobs/3198410447) but there coqbot is forbidden to retry the failed job (code 403). Maybe it doesn't have an elevated enough permission level? cc @Erik Martin-Dorel
I had a day off yesterday, will check what happens today
Note that coq-docker-machine-driver was still paused today. I leave it to you to unpause it and monitor what happens.
Hi, thanks for the ping, and sorry for not replying faster.
I was precisely planning to double-check the mathcomp CI config this week; so I'll let you know, @Théo Zimmermann
Last updated: Feb 02 2023 at 13:03 UTC