As a follow up on the discussions on threading options for VsCoq I would appreciate a short summary on the latest state of:
I guess the long term goal is to have something like a global worker pool and some heuristics to distribute workers to processes which have work to do. How far are we away from this?
A side node: I used par:
quite a bit in the past, but went away from it for two reasons:
There is a tool which is probably bitrot today called coqworkmgr
which acts as a token server (a la make) and should prevent too many workers (tactic or proof) to be spawned in the same unix shell (it tells you an env variable to export, and then all coq processes ask for a token there).
The huge memory footprint is due to the fact that we use processes and that we don't fork (even on unix) but rather marshall the state of coq to an unrelated process, duplicating all memory pages. In the new LSP thingy for vscoq I've been working on with maxime I do, at least, use fork on unix. That should remove some memory footprint. I don't see real threads an option, even in ocaml 5, since Coq is still a bit too imperative to be easily parallelized at a finer grain.
The implementation of par:
I'm talking about is not really related to the UI, so we could use it in today's Coq as well if the current code broke (I'm not sure I've seen the issues, but I've also been very busy for family reasons, pointers welcome).
@Michael Soegtrop in my opinion the workers design has never worked well and we will hopefully phase it out in favor of OCaml multicore, it is to be discussed internally in the next few weeks.
And, more in general, we have an issue with the memory footprint of Coq. IIRC @Pierre-Marie Pédrot discovered a quadratic factor, but I can't recall what it was (and it was not trivial to fix).
@gares: I hope "Unix" includes "macOS" here (which afaik is a true descendent of SystemV).
So in short the answer is that this is work in progress. That's fine - I can live with the current worker architecture. It sometimes requires some tweaking of CoqIDE parameters for developing certain proofs efficiently. I just fear that as is it requires quite a bit of experience to use it effectively and I hope that the future will be more automated.
Is coqworkmgr still operational?
I hope so, but we don't have tests in the test suite... here the doc: https://coq.inria.fr/refman/addendum/parallel-proof-processing.html#limiting-the-number-of-parallel-workers
Yes, OSX has a fork syscall with an implementation which copies pages on write, as far as I know. OCaml considers it as a unix.
Regarding coqworkmgr: what happens if I restrict it to say, 12, set the worker job count to 6 in my IDE and open 3 windows? Won't the last 3 windows ever get workers?
@Enrico Tassi having to start coqworkmgr
in the parent shell might be part of the problem — it's pretty invasive. dune 2 started its daemon automatically...
@Michael Soegtrop IIRC (I should have documented it more precisely) the allocation is not that static, when one needs to start a worker (for a proof) it waits for the token. It is not that the process hierarchy is built once and forall.
@Paolo Giarrusso , yes a global jobserver would be nice, I've nothing against using that thing, but I don't have time to port coq (not hard, but well..).
I don't doubt you're busy, and I'm not asking for myself right now (if I cared enough, I might be able to wrangle something together for my setup).
I am suggesting that (in the long term) changes here have lots of leverage: if coqworkmgr
works transparently, we could just enable 1 worker per CPU automatically, and people would get a 4x-10x-20x speedup depending on their machine.
sorry: they'd get whatever are the benefits of 4-10-20 workers, probably not a linear speedup but it should still help.
The speed up is actually quite linear as long as the proof times of the proofs is similar. VsCoq with the option is actually quite fast.
@Michael Soegtrop it would be nice to have some timings, because last time I measured it was slower to use 4 workers than just the regular coqc simple compiler
I guess we are talking about interactive mode, here.
I was using the command line I think, it is when we added the async to the test suite
and it took 4 hours to finish vs 20 mins the non async one
or something like that, I don't remember
of course this is very sensitivity to the particular proof setup
so I'm interested in michael numbers, mine were likely worst case
@Emilio Jesús Gallego Arias : did you run make sequentially then? It is not a fair model of interactive mode to run parallel make with workers vs parallel make without workers. If the parallel make can use the available CPU parallelism, workers are obviously pointless. But in interactive mode there is no make level parallelism.
At least in my stuff (which I cannot share) the difference between 12 workers and 1 worker is a few seconds to a few 10 seconds - as expected. I would think that one can see this in any file with say 100 lemmas in it.
I was comparing the compilation time for a single file. I guess each of your proofs take a long time to build/check, right?
Guess the slowdown is due to serializing tasks to separate processes?
I did measure up to 0.5 seconds to spawn a worker (send the state, etc) with large states. If your proofs are taking less than that, well, it is a waste. If the proofs are 10 seconds each then the story is different. Also the stm records how long it took, so next time it won't delegate a fast proof.
But in any case we are measuring the wrong thing. Async proofs help to quickly jump back and forth in a file with heavy proofs, not to do a quicker check of a file. Indeed the default number of workers is one, just to do some work in the background on proofs left aside while quickly reach the point of interest. I'm amazed that modern beefy machine can work with 10 workers (there is still only one master having to orchestrate and deal with user's interactive requests).
In the new vscoqtop I did use fork, and believe me, no fork takes 0.5 seconds.
Emilio Jesús Gallego Arias said:
I was comparing the compilation time for a single file. I guess each of your proofs take a long time to build/check, right?
I would say my proofs in the file I am currently working on each check in about 1s on average - in this area I get a speed of of about 3..4 for 6 threads - more threads don't give me a lot here. If the proofs are faster then the speed up is less (I would say 2x). If the proofs take longer the speed up reaches linear and also more threads make sense. Where it doesn't help a lot if is if one has many fast proofs and one big proof - something I try to avoid by breaking up lemmas.
Last updated: Oct 13 2024 at 01:02 UTC