Since I seem to spend most of my time perf
ing Coq I thought I'd start a topic where we can collect observations, hunches, and mistakes.
Here's my first contribution to this topic: @Joshua Yanovski recently helped me figure our why main
(or, rather, the wrappers around it that are the actual entry points to Coq) never seem to account for 100% of the stack samples that perf
collects. After a lot of digging though the linux kernel sources we figured out that perf stack dumps are limited to exactly 65528 bytes. Coq and other OCaml programs regularly exceed this limit due to their use of recursive functions without tail calls.
Sadly, perf
is not very smart about dealing with bigger-than-expected stacks and simply cuts off the bottom without telling the user about that. If you are unlucky, this will produce superficially OK-looking samples where some stack dumps simply start with non-entry-point functions like CArray.map
or other recursive—but not tail recursive—functions. If you are lucky, you'll at least get the dreaded 0xffffffffff function which, I think, really just means that the random bytes perf
found at the artificial "bottom" of the stack do not happen to correspond to a valid function address. At least that way it is somewhat more obvious that something went wrong.
This situation is really infuriating since it can skew measurements quite badly. I often find about 20% of stack dumps to be broken in this way. (It could be way worse, though. Coq is already helping quite a lot here since it replaces some of OCaml stdlib functions by tail-recursive ones, I think. Other OCaml programs easily produce 50% garbage stack dumps.)
So this is something to be aware of when using perf
on Coq. And I don't think there is a general way of repairing the samples. It is not good enough to simple throw away the broken stack dumps as that would be like distributing the time they took uniformly over the remaining stack dumps; which is not at all accurate. One could try to perform surgery on the samples, extending stacks that were cut with "reasonable" base stacks (also taken from the samples). This has several problems: a) there might not be a unique solution, b) depending on the sampling frequency, there might be no solution (this could happen when the stack grows too quickly between samples, leaving no overlap that could be used for matching up samples), and c) even a unique solution could fit only coincidentally.
The real solution, of course, is to patch the linux kernel. After spending a few days on that I decided to leave that to somebody who likes C and has the ability to perform changes on code that consists of mostly type casts..
I should mention that this should not skew "total time spent in function X" measurements. But it will skew anything that relates functions (time spent in children, total time minus time in children, etc.).
On a related note: profiler.firefox.com is actually amazing for exploring perf
data. (It consumes the output of perf script
.) It can even share profiles, which is something that is usually annoying to do. Here's an example coqc
invocation. (Look at the flame graph to see all the broken stack dumps that don't start with _start
!)
You make a good point @Janno about non tail- recursive standard functions; at some point I wouldn't mind we hide OCaml's stdlib in Coq and instead use ours, but that is not an easy task and kinda needs a lot of energy to perform.
@Emilio Jesús Gallego Arias Would there performance benefits to doing that?
Hard to tell, see discussion at https://discuss.ocaml.org/t/a-new-list-map-that-is-both-stack-safe-and-fast/865
Very interesting! Thanks for the link!
Sadly, I think that even if you got rid of all non-tail recursive functions from stdlib you'd still have many Coq functions that would grow the stack beyond perf
's arbitrary limit :(
We can at least document that, and hope that upstream does allow a setting in /sys
to tweak this at some point
@Emilio Jesús Gallego Arias we already do that somehow, most of the calls to List.map
are actually referring to CList.map
which is tailrec.
Usually, the large stack traces that I observe with perf are caused by functions that can't be made tailrec in a reasonable way, e.g. typically Constr.iter
and friends
@Janno I am a big user of perf, but I was never really hurt by this limitation since I usually care about the end of the traces, not their beginning. With a bit of flair you can quickly spot the root of a performance issue just by looking at the tail of trace, not having to care for what happens before.
@Pierre-Marie Pédrot I understand. This limitation also hasn't stopped me from deriving (sometimes) useful hypotheses from perf
. Still, it makes certain things awkward. For example, comparing function overhead between profiles with different stack heights can lead to artificial differences because the function overhead depends on how many stack dumps were cut off.
Oh, I should mention: the default stack size limit of perf
is 8k. To actually get the maximum 65528 you have to specify it via --call-graph dwarf,65528
. (Warning: this will blow up the size of perf.data!)
Also, I just learned about sysctl -w kernel.perf_event_max_stack=1024
which sets the maximum number of stack entries. This seems to be a factor that comes into play once the stack size limit is raised above 8k. Note: perf script
apparently needs to be called with --max-stack=1024
(but not perf record
).
Apparently the combination of 1024 stack entries and 65528 stack size helps a lot. Here's the same coqc invocation using both parameters. I don't see any more cut-off stacks. Sadly, I don't know what's up with the remaining weird stacks that still account for ~15% of the time. Still, it's an improvement.
Yes, indeed I do often have to raise the stack limit via sysctl
.
Last updated: Jun 05 2023 at 10:01 UTC