What is going on with the build:base
job? It keeps getting restarted; there already 10 instances of it in https://gitlab.com/coq/coq/-/jobs/1359488020
Looks like it is timing out. I've cancelled the last running instance to avoid it being restarted again.
Argh I misread.
That was not the issue.
The job is restarted because the last line of the log is ERROR: Job failed: exit code 137
which usually denotes an issue with the runner.
Anyway, manually cancelling is the thing to do in cases of looping restarts.
Sure, but then what? Should I wait a few days and then restart the job?
There is something fishy: several pipelines are experiencing the same recurring failure in the same base:build
job. Could it be that we've introduced a recent change that has broken this job when run with a subclass of runners?
5 pipelines were affected so far (323379491, 323309344, 323449544, 323400481, 323379222) and all of them only in the last couple of hours. Unless this keeps happening again tomorrow, I guess we can consider that this was a transient spurious issue with GitLab shared runners.
Given the last run command, could it be that we are hitting the command line limit?
(At a quick glance, the size of this command line is a few bytes over 32KB.)
I think this is also responsible for coqbot posting about minimization jobs every 15 minutes...
This is not working any better after waiting a few days. I am still unable to go past the build:base
job. And my theory about hitting the command limit is wrong, as the slaves have a 2MB limit.
I've cancelled the build:base jobs currently running, since they seem to be starving out other queued jobs, at least a little bit
maybe we should disable the shared runners? has the error happened with non shared runners?
maybe https://github.com/coq/coq/pull/14533 will help
Last updated: Oct 13 2024 at 01:02 UTC