Stream: Coq devs & plugin devs

Topic: Coq & `perf`


view this post on Zulip Janno (May 13 2020 at 08:34):

Since I seem to spend most of my time perfing Coq I thought I'd start a topic where we can collect observations, hunches, and mistakes.
Here's my first contribution to this topic: @Joshua Yanovski recently helped me figure our why main (or, rather, the wrappers around it that are the actual entry points to Coq) never seem to account for 100% of the stack samples that perf collects. After a lot of digging though the linux kernel sources we figured out that perf stack dumps are limited to exactly 65528 bytes. Coq and other OCaml programs regularly exceed this limit due to their use of recursive functions without tail calls.
Sadly, perf is not very smart about dealing with bigger-than-expected stacks and simply cuts off the bottom without telling the user about that. If you are unlucky, this will produce superficially OK-looking samples where some stack dumps simply start with non-entry-point functions like CArray.map or other recursive—but not tail recursive—functions. If you are lucky, you'll at least get the dreaded 0xffffffffff function which, I think, really just means that the random bytes perf found at the artificial "bottom" of the stack do not happen to correspond to a valid function address. At least that way it is somewhat more obvious that something went wrong.
This situation is really infuriating since it can skew measurements quite badly. I often find about 20% of stack dumps to be broken in this way. (It could be way worse, though. Coq is already helping quite a lot here since it replaces some of OCaml stdlib functions by tail-recursive ones, I think. Other OCaml programs easily produce 50% garbage stack dumps.)
So this is something to be aware of when using perf on Coq. And I don't think there is a general way of repairing the samples. It is not good enough to simple throw away the broken stack dumps as that would be like distributing the time they took uniformly over the remaining stack dumps; which is not at all accurate. One could try to perform surgery on the samples, extending stacks that were cut with "reasonable" base stacks (also taken from the samples). This has several problems: a) there might not be a unique solution, b) depending on the sampling frequency, there might be no solution (this could happen when the stack grows too quickly between samples, leaving no overlap that could be used for matching up samples), and c) even a unique solution could fit only coincidentally.
The real solution, of course, is to patch the linux kernel. After spending a few days on that I decided to leave that to somebody who likes C and has the ability to perform changes on code that consists of mostly type casts..

view this post on Zulip Janno (May 13 2020 at 08:42):

I should mention that this should not skew "total time spent in function X" measurements. But it will skew anything that relates functions (time spent in children, total time minus time in children, etc.).

view this post on Zulip Janno (May 13 2020 at 08:48):

On a related note: profiler.firefox.com is actually amazing for exploring perf data. (It consumes the output of perf script.) It can even share profiles, which is something that is usually annoying to do. Here's an example coqc invocation. (Look at the flame graph to see all the broken stack dumps that don't start with _start!)

view this post on Zulip Emilio Jesús Gallego Arias (May 13 2020 at 09:29):

You make a good point @Janno about non tail- recursive standard functions; at some point I wouldn't mind we hide OCaml's stdlib in Coq and instead use ours, but that is not an easy task and kinda needs a lot of energy to perform.

view this post on Zulip Janno (May 13 2020 at 09:33):

@Emilio Jesús Gallego Arias Would there performance benefits to doing that?

view this post on Zulip Emilio Jesús Gallego Arias (May 13 2020 at 09:34):

Hard to tell, see discussion at https://discuss.ocaml.org/t/a-new-list-map-that-is-both-stack-safe-and-fast/865

view this post on Zulip Janno (May 13 2020 at 09:43):

Very interesting! Thanks for the link!

view this post on Zulip Janno (May 13 2020 at 09:44):

Sadly, I think that even if you got rid of all non-tail recursive functions from stdlib you'd still have many Coq functions that would grow the stack beyond perf's arbitrary limit :(

view this post on Zulip Emilio Jesús Gallego Arias (May 13 2020 at 09:47):

We can at least document that, and hope that upstream does allow a setting in /sys to tweak this at some point

view this post on Zulip Pierre-Marie Pédrot (May 13 2020 at 10:02):

@Emilio Jesús Gallego Arias we already do that somehow, most of the calls to List.map are actually referring to CList.map which is tailrec.

view this post on Zulip Pierre-Marie Pédrot (May 13 2020 at 10:03):

Usually, the large stack traces that I observe with perf are caused by functions that can't be made tailrec in a reasonable way, e.g. typically Constr.iter and friends

view this post on Zulip Pierre-Marie Pédrot (May 13 2020 at 10:05):

@Janno I am a big user of perf, but I was never really hurt by this limitation since I usually care about the end of the traces, not their beginning. With a bit of flair you can quickly spot the root of a performance issue just by looking at the tail of trace, not having to care for what happens before.

view this post on Zulip Janno (May 13 2020 at 10:10):

@Pierre-Marie Pédrot I understand. This limitation also hasn't stopped me from deriving (sometimes) useful hypotheses from perf. Still, it makes certain things awkward. For example, comparing function overhead between profiles with different stack heights can lead to artificial differences because the function overhead depends on how many stack dumps were cut off.

view this post on Zulip Janno (May 13 2020 at 10:13):

Oh, I should mention: the default stack size limit of perf is 8k. To actually get the maximum 65528 you have to specify it via --call-graph dwarf,65528. (Warning: this will blow up the size of perf.data!)
Also, I just learned about sysctl -w kernel.perf_event_max_stack=1024 which sets the maximum number of stack entries. This seems to be a factor that comes into play once the stack size limit is raised above 8k. Note: perf script apparently needs to be called with --max-stack=1024 (but not perf record).

view this post on Zulip Janno (May 13 2020 at 10:27):

Apparently the combination of 1024 stack entries and 65528 stack size helps a lot. Here's the same coqc invocation using both parameters. I don't see any more cut-off stacks. Sadly, I don't know what's up with the remaining weird stacks that still account for ~15% of the time. Still, it's an improvement.

view this post on Zulip Pierre-Marie Pédrot (May 13 2020 at 10:35):

Yes, indeed I do often have to raise the stack limit via sysctl.


Last updated: Oct 16 2021 at 02:03 UTC