Stream: Coq devs & plugin devs

Topic: CI runners


view this post on Zulip Gaëtan Gilbert (Feb 08 2023 at 15:57):

Are we supposed to have this many offline runners? https://gitlab.com/groups/coq/-/runners
also we should do those recommended updates at some point

view this post on Zulip Gaëtan Gilbert (Apr 04 2024 at 07:48):

modified

view this post on Zulip Notification Bot (Apr 04 2024 at 07:53):

4 messages were moved here from #Coq devs & plugin devs > coq call by Théo Zimmermann.

view this post on Zulip Notification Bot (Apr 04 2024 at 07:54):

2 messages were moved here from #Coq devs & plugin devs > coq call by Théo Zimmermann.

view this post on Zulip Notification Bot (Apr 04 2024 at 07:55):

A message was moved here from #Coq devs & plugin devs > coq call by Théo Zimmermann.

view this post on Zulip Notification Bot (Apr 04 2024 at 07:56):

32 messages were moved here from #Coq devs & plugin devs > coq call by Théo Zimmermann.

view this post on Zulip Andres Erbsen (Apr 09 2024 at 03:06):

My test runner now shows up on https://gitlab.inria.fr/coq/coq/-/settings/ci_cd; I paused it just in case but it might be good to go -- feel free to unpause and see what happens. Here's notes from how I set it up.

view this post on Zulip Gaëtan Gilbert (Apr 09 2024 at 06:45):

what are the specs of the machine like?
is the runner allowing parallel jobs or is it 1 / machine?

view this post on Zulip Andres Erbsen (Apr 09 2024 at 12:00):

This is a beefy desktop (>2x faster than Github Actions), but I am using it for testing just because I have an easier time messing around with it -- actual runners will be nice too but not that nice. I didn't do anything to configure the number of jobs, but I think we should try multiple, maybe 16 (dedicating >4GB and a physical core to each).

view this post on Zulip Gaëtan Gilbert (Apr 09 2024 at 19:36):

(I'm on holiday this week so if you want to test do it yourself)

view this post on Zulip Andres Erbsen (Apr 09 2024 at 20:03):

Ok, enjoy! (I may just wait until next week in that case; I am not sure I'd be equipped to react appropriately if something goes wrong and it's not like I have shortage of tasks for this week.)

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:07):

I unpaused it for testing

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:07):

Seems to be doing something

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:08):

https://gitlab.inria.fr/coq/coq/-/jobs/4239889
https://gitlab.inria.fr/coq/coq/-/jobs/4239888

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:10):

Looks like the two jobs had lock contention between them and one of them failed as a result?

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:12):

Log here; I am clicking retry.

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:17):

Another failure with a concurrency smell: https://gitlab.inria.fr/coq/coq/-/jobs/4240661

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:23):

back to paused then

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:25):

what does /etc/gitlab-runner/config.toml look like?

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:25):

:thumbs_up:
https://gitlab.inria.fr/coq/coq/-/jobs/4240664#L1321 was another failure right before pause, I restarted, but it doesn't look like concurrency

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:26):

looks like concurrency to me, the binary got deleted just before getting execve'd or something

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:26):

Is there a way to limit the runner to 1 job at a time? (to see whether everything goes well in that case)

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:27):

in the toml https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-global-section concurrent

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:29):

...it already says concurrent = 1 on the first line :confounded:

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:29):

we should put the runner on another repo for further testing so we can avoid disrupting regular jobs

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:32):

https://docs.gitlab.com/runner/configuration/autoscale.html#limit-the-number-of-vms-created-by-the-docker-machine-executor does this read to you like concurrent does not apply to what we are doing?

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:33):

Are all the current runners doing one job at a time (so we'd be unsurprised if the scripts don't work concurrently)?

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:33):

are you using docker executor or docker machine executor?

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:34):

I don't know, script here says --executor docker.

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:34):

Gaëtan Gilbert said:

what does /etc/gitlab-runner/config.toml look like?

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:35):

for instance I have

concurrent = 6
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "ci-coq-02"
  url = "https://gitlab.inria.fr"
  token = "glrt-SECRET"
  executor = "docker"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "alpine"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:37):

concurrent = 1
check_interval = 0
connection_max_age = "15m0s"
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "vm2"
  url = "https://gitlab.inria.fr"
  id = 7401
  token = "glrt-4-xsrAsQexY6DZLZSkus" # now removed from project
  token_obtained_at = 2024-04-09T02:51:10Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker"
  [runners.cache]
    MaxUploadedArchiveSize = 0
  [runners.docker]
    tls_verify = false
    image = "70b31504fd78"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
    network_mtu = 0

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:46):

for some reason both jobs said

Running on runner-4-xsrasqe-project-4504-concurrent-0 via vm...

on the other runners we get separate concurrent-$X numbers for parallel jobs
not sure if this matters

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:48):

This looks wrong

image = "70b31504fd78"

we should try image = "alpine"

do you have a repo setup?

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:49):

I do not but I can probably create one.

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:50):

Or not. "Limit reached You cannot create projects in your personal namespace. Contact your GitLab administrator."

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:50):

I think you can fork but not create new

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:51):

This was a fork attempt

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:53):

I guess you can make a repo on gitlab.com instead (I recommend forking the old coq repo instead of making a fresh repo)

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:53):

Re image, yes, the one I use is genuinely different from Alpine -- it's from the Ubuntu-based Dockerfile in the repo. I don't quite see how this would lead to the issues, but definitely worth a try.

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:54):

it might be sharing the image because it's local
IDK much about docker

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:55):

Yeah neither do I.

view this post on Zulip Andres Erbsen (Apr 15 2024 at 14:57):

I am hesitant about testing with a parallel account and etc because I am not sure how easy it is to set up & how well potential success would carry over to the target environment. With what we started with I could see all failed jobs in the CI and retry them, which doesn't seem all that disruptive in the end. Perhaps we could continue with that?

view this post on Zulip Gaëtan Gilbert (Apr 15 2024 at 14:58):

I don't see what would be different between a test coq repo vs the real one

view this post on Zulip Andres Erbsen (Apr 15 2024 at 15:20):

Well looks like I got the same concurrency error (with --docker-image alpine) and some additional permissions errors.

view this post on Zulip Andres Erbsen (Apr 15 2024 at 15:36):

when I run with --log-level debug I see that the concurrency limit is read from the configuration file and printed back correctly:

concurrent: 1
checkinterval: 0
loglevel: null
logformat: null
user: ""
runners:
- name: vm
  limit: 0
  outputlimit: 0

view this post on Zulip Andres Erbsen (Apr 15 2024 at 15:36):

I gotta go now; let me know if you have any ideas for either issue here.


Last updated: Jun 23 2024 at 01:02 UTC