Is it known how much time Coq spends doing garbage collection when checking typical projects, e.g., those in CI?
Context: I talked to a co-author of a recent OOPSLA paper, where it is shown that with better tuning of heap size, one can reduce garbage collection time by up to 30% in applications like browsers. He has used Coq in the past, so they might be interested in experimenting with applying their technique to Coq.
I don't have an answer but I have a handful of data points. The x axis is the value of o
in OCAMLRUNPARAM
. This is with 4.14 and a=2
: image.png
This profile was taken from some random proof in our development. I doubt our proofs are representative of developments in CI but maybe the graph is still useful.
@Janno so you get a modest decrease in time with increased GC speed? What is RSS here?
resident set size. Basically actual memory used, not just virtual memory reserved.
I actually forgot what o
tunes exactly. The tradeoff to me is that we can increase memory consumption drastically for a modest but noticeable speed bump.
Did you also pass s=32M
to OCAMLRUNPARAM
? Otherwise, you will be in a completely different setting than vanilla Coq.
(This was necessary because Coq on 4.14 is just way slower on average than Coq on 4.07)
Guillaume Melquiond said:
Did you also pass
s=32M
toOCAMLRUNPARAM
? Otherwise, you will be in a completely different setting than vanilla Coq.
I haven't heard of that before. I suppose I should run some more experiments. Thanks for the pointer!
By default, Coq runs under the equivalent of OCAMLRUNPARAM=a=2,o=200,s=32M
for OCaml >= 4.10.
Wow. My computer is a bit too noisy to report anything with certainty but it seems like s=32M
speeds up a random proof I used for the experiment by 16% at o=200
.
Do you happen to know when this setting was selected for OCaml >= 4.10?
You can look at the history of https://github.com/coq/coq/blob/master/sysinit/coqinit.ml
Only the part a=2,o=200
was selected for OCaml >= 4.10. The part s=32M
has been there for ages.
Interesting. I suppose when I saw the regression in 4.14 I immediately started playing around with OCAMLRUNPARAM
without s
in it and never realized that I introduced an additional regression that way.
After checking, it has been there since 2012.
So far it seems that not having s=32M
has been a blessing in disguise, at least on our CI servers.
I am still testing different values but so far any value above s=256k
(the default when it is not overwritten by Coq) has resulted in unacceptable to catastrophic wall clock time increases.
This is very weird, increasing the minor heap is not a coq specific trick. The default is really too small. Is it the only parameter you are tweaking?
Yes
The result locally on my laptop is very, very different. Even at s=256M
I still see a decrease in wall clock time. Although the amount of cache misses leads to diminishing returns very quickly after s=16M
or so.
yeah we were looking at the data and it's... complicated, but
s
is in words, s=256k
means 2MB
, and with enough parallel jobs I'm no expert here but memory bandwidth benchmarks exhibit performance cliffs each time you overflow a layer in the memory hierarchy, so "overflowing L3 hits a cliff on this architecture" wouldn't be shocking.
Hum, I'm not very familiar with modern server hardware, but a while back more cores was paired with more cache. What you say seems to suggest you have many cores but little cache... :puzzle:
the above was somewhat off, but this processor has 64MB cache and 12 cores. I guess we mustn't just fit the minor heap?
BTW, is it true that data moves from minor to major heap after _1_ minor GC cycle? That's what https://v2.ocaml.org/api/Gc.html claims, but I'm used to semi-space collectors that move data between two semi-spaces
In OCaml 4.x that is the case AFAIK
Last updated: May 31 2023 at 16:01 UTC