Proof workers, thread workers, global make pools, ... · Coq devs & plugin devs

Stream: Coq devs & plugin devs

Topic: Proof workers, thread workers, global make pools, ...

Michael Soegtrop (Mar 10 2022 at 07:53):

As a follow up on the discussions on threading options for VsCoq I would appreciate a short summary on the latest state of:

proof workers
tactic workers
global handling of workers in a parallel make process

I guess the long term goal is to have something like a global worker pool and some heuristics to distribute workers to processes which have work to do. How far are we away from this?

A side node: I used par: quite a bit in the past, but went away from it for two reasons:

there were regressions
as is it is impossible to manage parallelism between make jobs, proof workers and tactic workers in a sensible way and in the end it is most efficient to not use tactic workers - which is a bit sad. Even in interactive development, where time is most important, if I set options such that I get optimal performance for heavily parallel compute intensive proofs (say coq-interval proofs about piece wise functions) my machine dies if I accidentally open a second window in CoqIDE.

Enrico Tassi (Mar 10 2022 at 09:14):

There is a tool which is probably bitrot today called coqworkmgr which acts as a token server (a la make) and should prevent too many workers (tactic or proof) to be spawned in the same unix shell (it tells you an env variable to export, and then all coq processes ask for a token there).

Enrico Tassi (Mar 10 2022 at 09:17):

The huge memory footprint is due to the fact that we use processes and that we don't fork (even on unix) but rather marshall the state of coq to an unrelated process, duplicating all memory pages. In the new LSP thingy for vscoq I've been working on with maxime I do, at least, use fork on unix. That should remove some memory footprint. I don't see real threads an option, even in ocaml 5, since Coq is still a bit too imperative to be easily parallelized at a finer grain.

Enrico Tassi (Mar 10 2022 at 09:20):

The implementation of par: I'm talking about is not really related to the UI, so we could use it in today's Coq as well if the current code broke (I'm not sure I've seen the issues, but I've also been very busy for family reasons, pointers welcome).

Emilio Jesús Gallego Arias (Mar 10 2022 at 09:21):

@Michael Soegtrop in my opinion the workers design has never worked well and we will hopefully phase it out in favor of OCaml multicore, it is to be discussed internally in the next few weeks.

Enrico Tassi (Mar 10 2022 at 09:22):

And, more in general, we have an issue with the memory footprint of Coq. IIRC @Pierre-Marie Pédrot discovered a quadratic factor, but I can't recall what it was (and it was not trivial to fix).

Michael Soegtrop (Mar 10 2022 at 10:11):

@gares: I hope "Unix" includes "macOS" here (which afaik is a true descendent of SystemV).

So in short the answer is that this is work in progress. That's fine - I can live with the current worker architecture. It sometimes requires some tweaking of CoqIDE parameters for developing certain proofs efficiently. I just fear that as is it requires quite a bit of experience to use it effectively and I hope that the future will be more automated.

Is coqworkmgr still operational?

Enrico Tassi (Mar 10 2022 at 10:13):

I hope so, but we don't have tests in the test suite... here the doc: https://coq.inria.fr/refman/addendum/parallel-proof-processing.html#limiting-the-number-of-parallel-workers

Enrico Tassi (Mar 10 2022 at 10:15):

Yes, OSX has a fork syscall with an implementation which copies pages on write, as far as I know. OCaml considers it as a unix.

Michael Soegtrop (Mar 10 2022 at 10:19):

Regarding coqworkmgr: what happens if I restrict it to say, 12, set the worker job count to 6 in my IDE and open 3 windows? Won't the last 3 windows ever get workers?

Paolo Giarrusso (Mar 10 2022 at 10:25):

@Enrico Tassi having to start coqworkmgr in the parent shell might be part of the problem — it's pretty invasive. dune 2 started its daemon automatically...

Enrico Tassi (Mar 10 2022 at 10:28):

@Michael Soegtrop IIRC (I should have documented it more precisely) the allocation is not that static, when one needs to start a worker (for a proof) it waits for the token. It is not that the process hierarchy is built once and forall.

Enrico Tassi (Mar 10 2022 at 10:29):

@Paolo Giarrusso , yes a global jobserver would be nice, I've nothing against using that thing, but I don't have time to port coq (not hard, but well..).

Paolo Giarrusso (Mar 10 2022 at 10:36):

I don't doubt you're busy, and I'm not asking for myself right now (if I cared enough, I might be able to wrangle something together for my setup).

I am suggesting that (in the long term) changes here have lots of leverage: if coqworkmgr works transparently, we could just enable 1 worker per CPU automatically, and people would get a ~~4x-10x-20x speedup~~ depending on their machine.

Paolo Giarrusso (Mar 10 2022 at 10:37):

sorry: they'd get whatever are the benefits of 4-10-20 workers, probably not a linear speedup but it should still help.

Michael Soegtrop (Mar 10 2022 at 12:55):

The speed up is actually quite linear as long as the proof times of the proofs is similar. VsCoq with the option is actually quite fast.

Emilio Jesús Gallego Arias (Mar 10 2022 at 16:47):

@Michael Soegtrop it would be nice to have some timings, because last time I measured it was slower to use 4 workers than just the regular coqc simple compiler

Enrico Tassi (Mar 10 2022 at 16:54):

I guess we are talking about interactive mode, here.

Emilio Jesús Gallego Arias (Mar 10 2022 at 16:57):

I was using the command line I think, it is when we added the async to the test suite

Emilio Jesús Gallego Arias (Mar 10 2022 at 16:57):

and it took 4 hours to finish vs 20 mins the non async one

Emilio Jesús Gallego Arias (Mar 10 2022 at 16:57):

or something like that, I don't remember

Emilio Jesús Gallego Arias (Mar 10 2022 at 16:58):

of course this is very sensitivity to the particular proof setup

Emilio Jesús Gallego Arias (Mar 10 2022 at 16:58):

so I'm interested in michael numbers, mine were likely worst case

Michael Soegtrop (Mar 10 2022 at 17:17):

@Emilio Jesús Gallego Arias : did you run make sequentially then? It is not a fair model of interactive mode to run parallel make with workers vs parallel make without workers. If the parallel make can use the available CPU parallelism, workers are obviously pointless. But in interactive mode there is no make level parallelism.

At least in my stuff (which I cannot share) the difference between 12 workers and 1 worker is a few seconds to a few 10 seconds - as expected. I would think that one can see this in any file with say 100 lemmas in it.

Emilio Jesús Gallego Arias (Mar 10 2022 at 17:45):

I was comparing the compilation time for a single file. I guess each of your proofs take a long time to build/check, right?

Paolo Giarrusso (Mar 11 2022 at 00:07):

Guess the slowdown is due to serializing tasks to separate processes?

Enrico Tassi (Mar 11 2022 at 06:33):

I did measure up to 0.5 seconds to spawn a worker (send the state, etc) with large states. If your proofs are taking less than that, well, it is a waste. If the proofs are 10 seconds each then the story is different. Also the stm records how long it took, so next time it won't delegate a fast proof.

But in any case we are measuring the wrong thing. Async proofs help to quickly jump back and forth in a file with heavy proofs, not to do a quicker check of a file. Indeed the default number of workers is one, just to do some work in the background on proofs left aside while quickly reach the point of interest. I'm amazed that modern beefy machine can work with 10 workers (there is still only one master having to orchestrate and deal with user's interactive requests).

Enrico Tassi (Mar 11 2022 at 06:45):

In the new vscoqtop I did use fork, and believe me, no fork takes 0.5 seconds.

Michael Soegtrop (Mar 11 2022 at 08:55):

Emilio Jesús Gallego Arias said:

I was comparing the compilation time for a single file. I guess each of your proofs take a long time to build/check, right?

I would say my proofs in the file I am currently working on each check in about 1s on average - in this area I get a speed of of about 3..4 for 6 threads - more threads don't give me a lot here. If the proofs are faster then the speed up is less (I would say 2x). If the proofs take longer the speed up reaches linear and also more threads make sense. Where it doesn't help a lot if is if one has many fast proofs and one big proof - something I try to avoid by breaking up lemmas.

Last updated: Oct 13 2024 at 01:02 UTC