@Michael Soegtrop I recently got the following error on Windows: https://coq.gitlab.io/-/coq/-/jobs/727083266/artifacts/artifacts/buildlogs/coq-local-make_err.txt
Does this mean that the disk on the VM is full? I'm going to try restarting the build to see how it goes.
Does this mean that the disk on the VM is full? I'm going to try restarting the build to see how it goes.
Likely yes - I will do a cleanup run tomorrow morning.
Thanks! Let me know when this is done. FTR the second run did go much further but also eventually failed with a similar "There is not enough space on the disk." issue.
@Théo Zimmermann : I did a cleanup of all runners and they should be in a reasonable state. But something weird is going on - sometimes when deleting cygwin a number of files (about 300) with 0 size remain. I can't delete these in any way - even after taking ownership of the files and resetting the ACLs. This is on all servers. Disk size wise it doesn't hurt but meanwhile most servers have 10 extra cygwin folders.
At some point I want to reboot the servers and see if this is a sort of zombie file. When would be a good time? Weekend? Or can we have an announced server maintenance time?
Probably toward the end of a weekend yes. In general, this would be the time with the fewest pipelines running.
And it's not a big deal if we have to restart some failing jobs anyway.
@Michael Soegtrop Perhaps there's some process lying around which has a lock on those files? Windows often forbids me from removing files that I own, unless I close whatever program they're open in. Sometimes I have to use a windows file unlocker program; I can dig it up for you if you want and perhaps there's a command line version?
@Jason Gross , yes something to check - I was in a bit of a hurry this morning. SysInternals ProcExplorer has a function to search for processes which have a file open. I did check that no cygwin services are installed.
@Michael Soegtrop Looks like we're having a "not enough space on disk" error again (https://gitlab.com/coq/coq/-/jobs/831601792).
I will do a cleanup run now.
Same issue as last time - cygwin creates some funny stuff in the file system which is almost impossible to delete (even with dedicated sysinternals tools). Essentially they create system reparse points with cygwin syntax path which is invalid for Windows. Theoretically one could have a dedicated handler for file system reparse points and possibly cygwin registers one but a hex dump of the binary contents of the reparse point doesn't look like it (it only has a few bytes besides the bogus path). I see what I can do to get rid of this crap.
Interestingly locally I don't have issues with this - it might be that the OS of the runners is too old for such fancy stuff.
OK, I now have a way to remove these files - it takes a TAKEOWN, ICACLS, FSUTIL REPARSEPOINT DELETE and a DEL to get rid of these files. I put new maintanance scripts on all runners which remove the folders reliably. We should use the same script in the CI jobs as well to cleanup.
I will also think about running the maintenance script as a regular job - it automatically deletes all folders older than 8 hours.
P.S.: All runners should be fine now
Thanks a lot!
@Michael Soegtrop It is surprising because your last maintenance round was not so long ago, but it seems that our Windows CI jobs are failing again due to disks being full!
@Théo Zimmermann : sorry, just saw this now - will do a cleanup soon. This is fairly random and not time proportional - debris remains if the network communication between the CI servers and gitlab is shaky. Usually such a situation is for a day or so and then it works fine for a few weeks.
I will do a cleanup later today. Please send me an email next time.
Thanks! Will do.
All clean again. There was a heavy hit on Nov 12th and one on Nov 14th. The new scripts work well - should be good to run onece per day (they clean up everything which is older than 8 hours).
This morning I see many windows workers stuck: "This job is stuck because you don't have any active runners online or available with any of these tags assigned to them: windows-inria "
Any ideas? CC @Maxime Dénès
This page claims that all our Windows VMs are up and running: https://ci.inria.fr/project/coq/slaves
But this page claims that they are all offline: https://gitlab.com/coq/coq/-/settings/ci_cd#js-runners-settings
where is the reboot button? ;-)
There is a stop button. Should we try to click on it?
I'll try with coq-windows-untrusted5
"Stop request sent. Please Wait, this action can take couple of minutes."
"Start request sent. Please Wait, this action can take couple of minutes."
The gitlab CI shows info about last time the worker was seen online, many of them did not respond for days!
just hover on the gray circle
anyway, untrusted 5 is back
I'm rebooting them all
thanks for the pointers!
There are still 2 workers offline, not windows one, the VM-3227fa4a-7673-4b6a-9e0e-eae19cf1fa69 and VM-07e3a7b9-6845-48d2-a942-2899dd2e5f73 . I restarted them too, but without success. They seem linux workers, no idea what they are there for.
No idea either. It's not really useful for us to have additional Linux workers (we do well with pyrolyse and roquableu). We could probably delete them.
can we replace them by windows workers? would it be useful?
I mean, we cap the max CPUs and RAM there. If we remove 2 VMs we have some spare resources. Maybe we can make the other workers "larger"?
Probably yes, though @Maxime Dénès would know better.
Just wanted to ask if this is all up and running again.
Yes, everything looks good right now.
Actually, no we have three runners that are shown as being offline again !
Did you find out what the root cause was last time? Network issues? In general the runners lose connection to Gitlab from time to time for a few hours - it has always been like this. That's the reason why debris cleanup is required because the runners kill the jobs when they loose connection. One could create a log file with a ping to gitlab every minute or so and analyze this.
Interesting! No, we didn't investigate
I still have a disk full https://gitlab.com/coq/coq/-/jobs/869917293
What should I do to clean it up? Is there a script?
Michael said he has a script but I don't recall if it's anywhere publicly accessible. Last time, he said we should e-mail him in case of problems.
The scripts are installed on all servers.
It will show a list of what to delete and ask for confirmation and then do it. It deletes everything which is older than 8 hours. Don't get confused by the "can't remove messages" it first does a quick run with fast tools and then a second run to clean up the remainign debris with more elaborate methods.
Btw.: I meanwhile think that the reason for the "after network issue debris removal issues" is that the maintenance cygwin and the runner cygwins have a different version and use the filesystem in different ways. An update might fix this - will look into it these days. Anyway, the scripts which are there are safe.
P.S.: I am running the scripts now
@Enrico Tassi : when doing heavy testing on Windows, it might make sense to run the scripts every other day - takes 5 minutes. Are you able to remote desktop easily? I have a script for that as well which also creates the (for me) required ssh tunnels.
Auf cause I still have it on my plate to run the scripts automatically once per day ...
OK all machines clean again.
One more note: some of the machines are with the Windows and the gitlab runner, .., alone already at 80% disk. Windows seems to grow substantially over time cause of the updates. I tried to run Windows disk cleanup, but this didn't help much (50 MB?). I can try some 3rd party tools although this might be a bit dangerous. A few years back a cleaned up machine was at 60%.
From the web interface I see
Primary storage : 400 / 500 GiB Secondary storage : 80.28516 / 500 GiB
I don't know how, but maybe we can just use these extra 100 + 420 GiB ?
(cleaning periodically is also good, but this could give us more margin)
If you want to get more actions / information on slaves (add a disk, create a template, etc.), you can access CloudStack using the same credentials as the CI portal. The Domain must be ci/coq
I just had to restart coq-windows-untrusted, coq-windows-untrusted2 and coq-windows-untrusted-i01 since gitlab did not see as online.
Did you recall doing anything special on these 3?
@Enrico Tassi : no, I run the same procedures on all machines. But as I said before it is "normal" that the connection is lost from time to time - usually such a situation lasts for on hour or two and goes back to normal automatically. It has always been like this. Why this is I can't tell. One could record wireshark logs of the communication with the gitlab runner to see what is going on, but it would be a lot of work to analyze these. It is btw usually sufficient to restart the local gitlab runner. Another option would be to ask every 10 minutes if the gitlab runner is healthy and if not restart it. This would kill jobs, so one probably would have to leave a note somewhere or restart the jobs automatically. Other suggestions are welcome.
I am running a bit of maintenance on the Windows CI runners (waiting on some builds). I try to not kill anything but can't guarantee it.
I've paused coq-windows-untrusted and I'm logged in. I'm having problems with linking pango (and compiling gtk from sources is broken) so I'll see what I can do.
OK - I was also logging into it - I installed an auto cleanup task yesterday and wanted to see what it did ...
One more note: as far as I remember the runners appear offline when they run a job which takes longer than an hour or two - I don't remember where the limit is. So when you run Windows with all plugins, it is normal that runners appear offline until they are finished.
Yesterday I made a cleanup of Windows cruft (the Windows Disk Cleanup did a good job after I installed a Microsoft patch for it). So the runners are in good shape again (without jobs disk usage is 61..65%, before it was 69%..80%).
I will now install the Scheduled cleanup on all runners.
Ops, "the session was replaced". I guess you did steal mine. No problem the job was not running.
BTW, how am I supposed to do? Which cygwin instance to use and where to put my temp stuff and which script to run? I did start the cygwin from ci and clone Coq there, but I guess it is not the intended way
(also gitlab.bat creates random directory and such... I guess I should call the makecoq_wingw.bat directly, no?)
@Enrico Tassi : the intended way is to use the administrative cygwin to e.g. git clone a repo. Then you should use the windows installer batch file to create a new cygwin for the purpose of compiling Coq. You can't run the bash script directly from the administrative cygwin - it is missing a lot of packages. The issue is that the cygwin package selection changes and it is sometimes sensitive to the version of packages. This is the reason we always create a new cygwin for building Coq. If you want, I can create a batch file, which compiles coq from a folder "coq" in the home directory of user CI in the administrative cygwin.
I mean a batch file which gives suitabke parameters to dev/build/windows/MakeCoq_MinGW.bat
Let me quickly long in to this machine and start a compile for you there OK?
Note: all Windows CI runners now have a scheduled task to clean up debris. It runs every day at 3 AM and cleans everything which is older than 8 hours. Together with the general cleanup I did yesterday, this should bring us smoothly through the 8.13 releases.
The log file for the cleanup is in the home directory of user CI in the administrative cygwin.
OK, thanks. I don't think I'll have time to work on this today. I just wanted to know how to proceed on Monday.
I created a file "MakeCoqMinGW_local.bat" cloned Coq and checked out v8.13 and started it. You can restart the batch file for package level incremental builds (e.g. from cygwin with "./MakeCoqMinGW_local.bat"). Running "makecoq_mingw.sh" from the created cygwin is not recommended because it needs some env vars, but you can start the created cygwin, cd to /build/<package> and run make or whatever there for file level incremental builds of packages.
One more note: you did something strange to the keyboard. Usually you set remote desktop so that it translates from your local keyboard language to English and keep the VMs on English keyboard. In case you change the VM local keyboard settings, please log off before you end your screen session.
One more thing: when you close a remote desktop session on Windows it is like "lock screen" not like "log off". When you connect later again, you should find the screen as you (or I) left it. Also stuff continues to run. You should explicitly log off when you don't want this.
@Enrico Tassi : the v8.13 branch doesn't get very far. I guess I should try my luck with your PR #13450 branch?
https://github.com/coq/coq/pull/13511 contains the subset of commits necessary for CoqIDE alone. Then it still fails as per:
/usr/lib/gcc/x86_64-w64-mingw32/9.2.0/../../../../x86_64-w64-mingw32/bin/ld: C:/ci/cygwin64_24910_13381/usr/x86_64-w64-mingw32/sys-root/mingw/libocaml/site-lib/lablgtk3\liblablgtk3_stubs.a(cairo_pango_stubs.o): in function `caml_pango_cairo_font_map_get_default': /build/lablgtk-3.1.1/_build/default/src/cairo_pango_stubs.c:63: undefined reference to `pango_cairo_font_map_get_default'
(and the commit replacing $CC with something else is also partially buggy, but not the cause I thing.
I meant, #13450 contains a fix for the $CC problem which does not work, while 13511 does not.
I think $(CC) should work - $CC is just a typo (I think I made). Not sure why it anyway did work.
Ah ok, I did not know CC was sensible on cygwin+mingw, so I tried to fetch the value out of ocamlc -config but there is some \r and some ' that one needs to cut away
I checked and the pango symbol that causes the problem is part of the version we (don't) build from sources. So I guess it's also in the version part of cygwin (which is surely newer)
@Michael Soegtrop It seems like the Windows runners need maintenance again :frown:
@Théo Zimmermann : shouldn't be since there are now cron jobs, but let me check ...
@Théo Zimmermann : the runners where all fine - nothing to do. The cron jobs seem to do their job (one runner had a failed build from this morning, but this is not an issue). When there are issues this should be temporary network issues - which are the issues which did lead to the disk issues in the past. Btw.: it is normal that a runner appears offline when it is working. When they are working they don't communicate with gitlab and gitlab reports them as offline after 30min or so.
@Michael Soegtrop There's still something off because Windows jobs are failing on all PRs currently, including on the
v8.13 branches, proving that this is not a consequence of a change in
The failure always happen during the OCaml build (4.08.1 in
master but 4.07.1 in
The error is the following:
/usr/lib/gcc/x86_64-w64-mingw32/10/../../../../x86_64-w64-mingw32/bin/ld: libcamlrun.a(backtrace.o):/build/ocaml-4.07.1/byterun/backtrace.c:31: multiple definition of `caml_debug_info'; libcamlrun.a(backtrace_prim.o):/build/ocaml-4.07.1/byterun/backtrace_prim.c:47: first defined here collect2: error: ld returned 1 exit status ** Fatal error: Error during linking
/usr/lib/gcc/x86_64-w64-mingw32/10/../../../../x86_64-w64-mingw32/bin/ld: libcamlrun.a(backtrace_b.o):/build/ocaml-4.08.1/runtime/backtrace.c:31: multiple definition of `caml_debug_info'; libcamlrun.a(backtrace_byt_b.o):/build/ocaml-4.08.1/runtime/backtrace_byt.c:47: first defined here collect2: error: ld returned 1 exit status
Could it be due to a Cygwin update?
Anyway, that's not a big deal (not something worth spending time investigating) if the job is soon going to be replaced by one based on the platform that works.
isn’t that the
-Wcommon change? IIRC due to gcc 10 (as hinted in the path), and worked around in opam packages for ocaml compilers (by changing compiler options)
-fcommon: https://github.com/ocaml/opam-repository/pull/16722 and https://github.com/ocaml/opam-repository/pull/16583
Thanks! That might be it. It's not the first time I get hit by this and yet, I forgot about it.
Anyway, it would be too much work to patch these jobs now, so I'll let them fail. The platform job shouldn't have this problem since it's opam-based.
Yes. The platform based build should be in early next week.
Windows CI is failing both in the GitHub Action and in GitLab CI with:
/usr/bin/install: cannot stat 'opam-putenv.exe': No such file or directory
Is this a bug with the opam mingw installer or is this an incompatibility with a new cygwin update?
@Théo Zimmermann sorry, I saw this just now. Coq platform master looks fine (https://github.com/coq/platform/actions?query=event%3Aschedule). If this is still there, I can look into it tomorrow.
It looks like things are back to normal.
# cp: error writing '/home/SYSTEM/.opam/_coq-platform_.master/lib/ocaml/weak.ml': No space left on device
isn't there supposed to be some automatic cleanup?
And yet, such errors still happen from time to time!
Sometimes the created cygwin folder has strange permissions and is hard to delete even as admin - I didn't research as yet what extra steps would help to delete the folders in these cases or what causes these deviations.
(I will do a cleanup run later today).
Just wanted to mention that there is a new Windows issue - as it looks some issue around pkg-config, either in the cygwin pkg-config package or in the opam MinGW conf package for it. I am working on a fix.
Regarding the runners: I had a look at the logs and investigated it. As it looks I made as mistake when copying the job description XML file from one machine to the other. This had the effect that the scheduled job did not run as Admin on all machines. This should work now.
Regarding the current failure of Windows CI: this is caused by the change of the conf-pkg-config opam package. This was recently changed - inspired by some hacks I did for conf-bison - to create an opam switch local pkg-condig folder and for a reason I don't understand as yet this breaks the MinGW pkg-config . On a MinGW cross cygwin we have two pkg-config, one for cygwin itself and one for MinGW. Somehow these conflict now why the nicely coexisted before. Needs further investigation ...
@Théo Zimmermann @Enrico Tassi : I just had a look at the coq platform based Coq CI on master and see that you are using the Coq platform v8.13 branch, rather than the master branch. I would think as long as one only builds Coq and CoqIDE and these are tied to the master versions in some way, it wouldn't make much of a difference, but I put fixes usually first in the master branch, because this has a nightly CI, and then merge them over to the release branches. Temporary quick fixes (local opam pateches) I usually don't merge into the release branches at all, because I expect these to be fixed upstream within a week or so.
I think it was an overlook, or maybe master was too far behind whn I added the job. Anyway, I'll for tracking master. Please do a PR, or I'll do it in the next days.
OK, I will do a PR tomorrow morning. The opam issue which leads to CI failures has a temporary fix in platform master already.
Indeed I think master was not yet ready when you moved from the legacy build to the platform build.
.gitlab-ci.yml does link to Coq platform master - just the comment at the top of
platform-windows.bat says that one could link to v8.13, but I see this as description of the script parameters. So things ard fine.
We should stop writing comments, they are just misleading ;-)
I just wanted to ask if anybody has seen disk full related Windows failures since my last update of the runner auto cleanup. Does it work now?
I have seen one or two Windows failures but I haven't looked closely into them. But it looks like things are generally working fine now.
windows is failing today
last success https://github.com/coq/coq/runs/4853644967?check_suite_focus=true
I haven't looked into this for a while - does this still use INRIA runners? If so, I can have a look.
No, this no longer uses the Inria runners. Instead, it just runs on GitHub Actions.
Last updated: Jun 04 2023 at 19:30 UTC