Stream: Coq users

Topic: Coq boost from 100MB L3 cache?


view this post on Zulip Karl Palmskog (Apr 20 2022 at 19:01):

Given that Coq is memory bound, could one expect a big boost to coqc from having near-100 MB L3 cache, like in AMD's 5800x3d? (For example, compared to the 5800 with only 32 MB L3 cache)

view this post on Zulip Guillaume Melquiond (Apr 20 2022 at 19:16):

The minor heap tends to trash any kind of cache in a memory-bound application. So, I doubt it would make a difference as is. However, if Coq's minor heap was reduced to something like 50MB, then the minor heap would presumably always stay in cache and never be streamed to/from the main memory, possibly improving performance.

view this post on Zulip Karl Palmskog (Apr 20 2022 at 19:29):

ah, unfortunate that it's not an obvious win, but maybe someone will try it as a parameter tweaking and Coq benchmarking project...

view this post on Zulip Guillaume Melquiond (Apr 20 2022 at 19:36):

By reducing the minor heap by a factor 5, we are causing OCaml's runtime to perform 5 times as many garbage collections. Most of that overhead should be absorbed by the fact that these collections now occur in the L3 cache rather than the main memory. But whether that is enough to actually bring a noticeable speedup is not obvious to me.

view this post on Zulip Stefan Monnier (Apr 20 2022 at 19:48):

5 times as many collections, but each collection has to scan a 5 times smaller minor heap. I'm surprised the minor heap is that large. Back some years ago when I was interested in the topic it seemed accepted wisdom that your generations should be sized more or less according to your cache level(s), tho the nursery was usually made to fit the L2 rather than L1 (an L1-sized nursery ended up not giving enough time for objects to die before the next collection, IIRC). Of course, it depends on lots of other factors and I'm sure the OCaml&Coq guys have good reasons to use different sizes.

view this post on Zulip Guillaume Melquiond (Apr 20 2022 at 19:55):

To make it short, the larger the minor heap the faster Coq was (and noticeably so). The size 256MB was a tradeoff: as large as possible without saturating our users' computer memory. (The idea was that we wanted four instances of Coq to be able to run in parallel without swapping.)

view this post on Zulip Paolo Giarrusso (Apr 20 2022 at 21:43):

How long ago was this tuning done? OCaml's GC algorithm changed in 2020 (4.10), and "1GB RAM" seems much older.

view this post on Zulip Paolo Giarrusso (Apr 20 2022 at 21:45):

BTW, does the nursery use a semispace copying collector? (EDIT: yes, according to https://dev.realworldocaml.org/garbage-collector.html ) With those, minor GC are _not_ O(nursery size) but O(live data size). And the basic idea of generational GC is the assumption that live data size << nursery size

view this post on Zulip Guillaume Melquiond (Apr 21 2022 at 01:34):

Paolo Giarrusso said:

How long ago was this tuning done? OCaml's GC algorithm changed in 2020 (4.10), and "1GB RAM" seems much older.

This tuning was performed in 2012 and the assumption was more like "2GB RAM". (You cannot just stop at the minor heaps, you also need to account for the major heaps, for the operating system, for the graphical environment, and so on.) Also, the way the minor heap is handled in OCaml has not changed in 2020; it has always been a bump allocator, and presumably always will be.

view this post on Zulip Michael Soegtrop (Apr 21 2022 at 06:33):

One should check what cache policy the L3 cache uses. LRU is not uncommon in the x64 world (it is uncommon in ARM) but I am not sure if this applies to L3. If the L3 strategy is LRU one has to be very careful to stay absolutely below the cache limits with code + minor heap. With a regular GC and LRU it is likely all or nothing.

view this post on Zulip Karl Palmskog (Apr 21 2022 at 07:08):

looks like the policy is PLRU in AMD Ryzen:

the replacement policy of the implemented LLC is pseudo-LRU as consecutive accesses executed by a single core to congruent cache lines will quickly cause an eviction.

view this post on Zulip Janno (Apr 21 2022 at 07:25):

I have read many times that Coq is memory bound but my own experiments never really showed that to be entirely true. On the few systems that I did any benchmarks on Coq seems to scale linearly with clock frequency (in reasonable ranges on modern hardware) and I don't think that meshes well with the claim that it is memory bound. Has anyone done rigorous experiments in this space?

view this post on Zulip Karl Palmskog (Apr 21 2022 at 07:26):

how else do you explain the big boost on 32-bit OCaml?

view this post on Zulip Karl Palmskog (Apr 21 2022 at 07:27):

see discussion here for example.

Coq with 32-bit pointers really can be twice as fast as with 64-bit pointers

view this post on Zulip Janno (Apr 21 2022 at 07:38):

I am not sure where exactly the "twice as fast" bit is documented but I found https://coq-club.inria.narkive.com/AjYlt8UL/speed-of-coq-8-8-vs-8-7-32-bit-vs-64-bit which documents a 17% difference. The thread has replies of other people that found smaller differences. Maybe it depends a lot on the exact kind of development? Or maybe my understanding of what it means to be memory bound is wrong.

view this post on Zulip Karl Palmskog (Apr 21 2022 at 07:42):

I meant memory bound as: the memory architecture including cache pipelines will determine performance more than CPU speed, generally

view this post on Zulip Karl Palmskog (Apr 21 2022 at 07:48):

if Andrew's conjecture in that post is true, then I think it would imply that 100MB L3 cache would help regardless of minor heap size.

you can fit a lot more 32-bit pointers into the 6 megabyte cache than you can fit 64-bit pointers.

vs. you can fit a lot more pointers into the 100 MB cache than you can fit into the 32 MB cache

view this post on Zulip Janno (Apr 21 2022 at 07:50):

Karl Palmskog said:

I meant memory bound as: the memory architecture including cache pipelines will determine performance more than CPU speed, generally

That is also my understanding of the term. Can it be true that the performance is both cpu bound and memory bound? Or could it be the case that higher frequency gives linear scaling but more cache/faster memory/.. gives even better scaling?

view this post on Zulip Karl Palmskog (Apr 21 2022 at 07:54):

sure, that could be the case, the only thing I've seriously looked at is performance boosts in Coq from parallelism on a fixed system with large amounts of RAM.

I'm not even sure what the implementation is doing most of the time (presumably doing GC?)

view this post on Zulip Pierre-Marie Pédrot (Apr 21 2022 at 07:56):

if you're talking about the vio mode, it's probably passing most of its time marshalling / demarshalling state

view this post on Zulip Karl Palmskog (Apr 21 2022 at 07:56):

right, yes, but the "classic" make -j was also studied there, which is presumably the one being GC prone

view this post on Zulip Karl Palmskog (Apr 21 2022 at 07:59):

but anyway, the main motivation for asking about this is that we found diminishing returns around 4 parallel jobs mark, both for vio and make -j. So for most Coq projects, 16 or 32 cores will not help much. But my guess is that 100 MB cache will help, and 8 physical cores is a kind of sweet spot for Coq proof dev, which the 5800x3d is limited to [I guess due to cache taking so much die space]

view this post on Zulip Karl Palmskog (Apr 21 2022 at 08:05):

it's all speculation at this point, but there might be an ideal "Coq machine" that could be built at the 1200-1400 EUR mark (given that that the 5800x3d is currently around 450 EUR)

view this post on Zulip Paolo Giarrusso (Apr 21 2022 at 10:00):

... also beware with a significantly slower core frequency

view this post on Zulip Paolo Giarrusso (Apr 21 2022 at 10:02):

@Janno to some extent Coq's also an interpreter, so maybe we should ask if specific .v files are memory- or CPU- bound?

view this post on Zulip Karl Palmskog (Apr 21 2022 at 10:03):

for sure, any analysis must pick "representative" Coq files. But I think we have a bunch of those in the Coq Platform.

view this post on Zulip Paolo Giarrusso (Apr 21 2022 at 10:04):

Sure but actual workloads need not be representative.

view this post on Zulip Karl Palmskog (Apr 21 2022 at 10:16):

OK, then we need to pick our RQs:

view this post on Zulip Karl Palmskog (Apr 21 2022 at 10:43):

how much RAM should a Coq machine have? For Chlipala style proofs, probably 128 GB+. But more realistically, 64 GB might do the trick on 8 physical cores (we found virtual cores essentially useless for performance)

view this post on Zulip Emilio Jesús Gallego Arias (Apr 21 2022 at 13:35):

See also https://discuss.ocaml.org/t/ocaml-benchmarks-on-different-processor-architectures/8802

view this post on Zulip Karl Palmskog (Apr 21 2022 at 13:36):

good point, but OCaml benchmarking is so huge, I wouldn't be able to make much sense of it

view this post on Zulip Karl Palmskog (Apr 21 2022 at 13:38):

for example, I think people are still arguing about compilation to continuations vs. regular heap allocation? Makarius abandoned SMLNJ because he believed the latter was better for Isabelle


Last updated: Feb 06 2023 at 12:04 UTC