Given that Coq is memory bound, could one expect a big boost to coqc
from having near-100 MB L3 cache, like in AMD's 5800x3d? (For example, compared to the 5800 with only 32 MB L3 cache)
The minor heap tends to trash any kind of cache in a memory-bound application. So, I doubt it would make a difference as is. However, if Coq's minor heap was reduced to something like 50MB, then the minor heap would presumably always stay in cache and never be streamed to/from the main memory, possibly improving performance.
ah, unfortunate that it's not an obvious win, but maybe someone will try it as a parameter tweaking and Coq benchmarking project...
By reducing the minor heap by a factor 5, we are causing OCaml's runtime to perform 5 times as many garbage collections. Most of that overhead should be absorbed by the fact that these collections now occur in the L3 cache rather than the main memory. But whether that is enough to actually bring a noticeable speedup is not obvious to me.
5 times as many collections, but each collection has to scan a 5 times smaller minor heap. I'm surprised the minor heap is that large. Back some years ago when I was interested in the topic it seemed accepted wisdom that your generations should be sized more or less according to your cache level(s), tho the nursery was usually made to fit the L2 rather than L1 (an L1-sized nursery ended up not giving enough time for objects to die before the next collection, IIRC). Of course, it depends on lots of other factors and I'm sure the OCaml&Coq guys have good reasons to use different sizes.
To make it short, the larger the minor heap the faster Coq was (and noticeably so). The size 256MB was a tradeoff: as large as possible without saturating our users' computer memory. (The idea was that we wanted four instances of Coq to be able to run in parallel without swapping.)
How long ago was this tuning done? OCaml's GC algorithm changed in 2020 (4.10), and "1GB RAM" seems much older.
BTW, does the nursery use a semispace copying collector? (EDIT: yes, according to https://dev.realworldocaml.org/garbage-collector.html ) With those, minor GC are _not_ O(nursery size) but O(live data size). And the basic idea of generational GC is the assumption that live data size << nursery size
Paolo Giarrusso said:
How long ago was this tuning done? OCaml's GC algorithm changed in 2020 (4.10), and "1GB RAM" seems much older.
This tuning was performed in 2012 and the assumption was more like "2GB RAM". (You cannot just stop at the minor heaps, you also need to account for the major heaps, for the operating system, for the graphical environment, and so on.) Also, the way the minor heap is handled in OCaml has not changed in 2020; it has always been a bump allocator, and presumably always will be.
One should check what cache policy the L3 cache uses. LRU is not uncommon in the x64 world (it is uncommon in ARM) but I am not sure if this applies to L3. If the L3 strategy is LRU one has to be very careful to stay absolutely below the cache limits with code + minor heap. With a regular GC and LRU it is likely all or nothing.
looks like the policy is PLRU in AMD Ryzen:
the replacement policy of the implemented LLC is pseudo-LRU as consecutive accesses executed by a single core to congruent cache lines will quickly cause an eviction.
I have read many times that Coq is memory bound but my own experiments never really showed that to be entirely true. On the few systems that I did any benchmarks on Coq seems to scale linearly with clock frequency (in reasonable ranges on modern hardware) and I don't think that meshes well with the claim that it is memory bound. Has anyone done rigorous experiments in this space?
how else do you explain the big boost on 32-bit OCaml?
see discussion here for example.
Coq with 32-bit pointers really can be twice as fast as with 64-bit pointers
I am not sure where exactly the "twice as fast" bit is documented but I found https://coq-club.inria.narkive.com/AjYlt8UL/speed-of-coq-8-8-vs-8-7-32-bit-vs-64-bit which documents a 17% difference. The thread has replies of other people that found smaller differences. Maybe it depends a lot on the exact kind of development? Or maybe my understanding of what it means to be memory bound is wrong.
I meant memory bound as: the memory architecture including cache pipelines will determine performance more than CPU speed, generally
if Andrew's conjecture in that post is true, then I think it would imply that 100MB L3 cache would help regardless of minor heap size.
you can fit a lot more 32-bit pointers into the 6 megabyte cache than you can fit 64-bit pointers.
vs. you can fit a lot more pointers into the 100 MB cache than you can fit into the 32 MB cache
Karl Palmskog said:
I meant memory bound as: the memory architecture including cache pipelines will determine performance more than CPU speed, generally
That is also my understanding of the term. Can it be true that the performance is both cpu bound and memory bound? Or could it be the case that higher frequency gives linear scaling but more cache/faster memory/.. gives even better scaling?
sure, that could be the case, the only thing I've seriously looked at is performance boosts in Coq from parallelism on a fixed system with large amounts of RAM.
I'm not even sure what the implementation is doing most of the time (presumably doing GC?)
if you're talking about the vio mode, it's probably passing most of its time marshalling / demarshalling state
right, yes, but the "classic" make -j
was also studied there, which is presumably the one being GC prone
but anyway, the main motivation for asking about this is that we found diminishing returns around 4 parallel jobs mark, both for vio and make -j
. So for most Coq projects, 16 or 32 cores will not help much. But my guess is that 100 MB cache will help, and 8 physical cores is a kind of sweet spot for Coq proof dev, which the 5800x3d is limited to [I guess due to cache taking so much die space]
it's all speculation at this point, but there might be an ideal "Coq machine" that could be built at the 1200-1400 EUR mark (given that that the 5800x3d is currently around 450 EUR)
... also beware with a significantly slower core frequency
@Janno to some extent Coq's also an interpreter, so maybe we should ask if specific .v files are memory- or CPU- bound?
for sure, any analysis must pick "representative" Coq files. But I think we have a bunch of those in the Coq Platform.
Sure but actual workloads need not be representative.
OK, then we need to pick our RQs:
coqc
, on a given architecture and CPU frequency and memory size, for common Coq workloads?how much RAM should a Coq machine have? For Chlipala style proofs, probably 128 GB+. But more realistically, 64 GB might do the trick on 8 physical cores (we found virtual cores essentially useless for performance)
See also https://discuss.ocaml.org/t/ocaml-benchmarks-on-different-processor-architectures/8802
good point, but OCaml benchmarking is so huge, I wouldn't be able to make much sense of it
for example, I think people are still arguing about compilation to continuations vs. regular heap allocation? Makarius abandoned SMLNJ because he believed the latter was better for Isabelle
Last updated: Oct 08 2024 at 14:01 UTC