I'm going to remove the coq-windows runners from gitlab as
I paused coq-docker-machine-driver as we're getting tons of "runner system failure" on it
it could be a gitlab-runner regression actually, they made a bug fix release a few days ago which seems related
I'm getting a lot of space related issues on runners. This makes the job of preparing backports for Coq 8.17.1 difficult.
It looks like it happens specifically with ci-coq-01-runner-03. Should we disable this runner?
You can disable it. If needed, I think I can still add more ephemeral runners.
I wanted to do so and then discovered that it exists twice:
Is it equivalent if I disable it in one place or the other?
are you sure it is the same?
No, but it has the same name, and I cannot know which one it was that keeps failing.
Are you sure it has the same name?
I have limited internet access, but I see only ci-coq-03 on the repo (vs ci-coq-01-runner-03 in the group).
Oh indeed, you are correct.
So that's the latter that I should disable (the one in the group).
if you want the background behind these namings: we have 4 physical machines ci-coq-01, ..., ci-coq-04
They are meant to be used for the bench, but since the lack of runners issue, we turned ci-coq-01 and ci-coq-02 into regular runners, with VMs on top
I have disabled the coq-docker-machine-driver runner after seeing many systematic runner failures with this runner today.
The errors are always the same:
ERROR: Preparation failed: exit status 1 Will be retried in 3s ... ERROR: Preparation failed: exit status 1 Will be retried in 3s ... ERROR: Preparation failed: exit status 1 Will be retried in 3s ... ERROR: Job failed (system failure): exit status 1
I have also retried lots of pipelines that had many jobs that had failed for this reason.
I also had to turn off the test-docker-machine-driver for the same reason.
And FTR this means that jobs modifying the Docker image will be stuck and will require temporarily activating shared runners for them to be unstuck.
I have also disabled coq-01-runner-04, which kept failing with no space available on device.
we have only 2 active non bench runners
@Maxime Dénès are you on holidays?
I'm not, I'll have a look
Ok, I restored the Cloud-based runners. The root cause seems to be some CloudStack API timeouts on Friday. I'm trying to understand better.
Could be related to the GitLab maintenance?
I don't think so, but I requested more info from the team in charge (in particular their logs around the event)
Last updated: Dec 07 2023 at 06:38 UTC