Stream: Coq devs & plugin devs

Topic: GC performance on OCaml 4.14


view this post on Zulip Janno (May 02 2022 at 11:45):

Does Coq use different settings for GC in coqtop versus coqc? If so, could that epxlain why I am seeing a reproducible speedup from OCaml 4.07 to 4.14 in coqtop but a (much much larger) reproducible slowdown in coqc?

view this post on Zulip Janno (May 02 2022 at 11:58):

I am asking about GC specifically because the perf profile of coqc on 4.14 shows a huge amount of extra GC time. These functions also trigger GC on 4.07 but to much a lesser extent, according to perf. The file I am looking at suffers a 10x performance degradation from this phenomenon in one specific (custom plugin) OCaml function.

view this post on Zulip Gaëtan Gilbert (May 02 2022 at 12:02):

relevant code is around https://github.com/coq/coq/blob/fdc8569cf9a24bcd90e3db218f56999b78557808/sysinit/coqinit.ml#L52-L76
it seems to be called by both coqtop and coqc

view this post on Zulip Gaëtan Gilbert (May 02 2022 at 12:06):

is the slowdown everywhere or for specific operations? eg it could be slower at writing the vo

view this post on Zulip Janno (May 02 2022 at 12:07):

The slowdown in this file I am looking is almost entirely in one of our own custom tactics implement in OCaml. 4.07 takes about 100ms to execute it in coqc but 4.14 takes 1.2s due to GC.

view this post on Zulip Gaëtan Gilbert (May 02 2022 at 12:09):

what about coqtop times?

view this post on Zulip Janno (May 02 2022 at 12:10):

For some reason perf does not want to capture the tactic invocation there. But coqtop on 4.14 is about 15% faster than coqtop on 4.07.

view this post on Zulip Gaëtan Gilbert (May 02 2022 at 12:12):

what does Time say in all 4 cases?

view this post on Zulip Janno (May 02 2022 at 12:14):

Here's two traces from coqc, the first on 4.07 and the second on 4.14. The second Tactic call is the custom Ocaml tactic, the first is a pure ltac tactic. The second to last item is Qed. The rest is Requires/Imports/Section/End, etc.

Finished transaction in 1.847 secs (1.51u,0.336s) (successful)
Finished transaction in 0. secs (0.u,0.s) (successful)
Finished transaction in 0.142 secs (0.138u,0.003s) (successful)
Finished transaction in 0.001 secs (0.001u,0.s) (successful)
Finished transaction in 0.001 secs (0.001u,0.s) (successful)
Tactic call ran for 0.681 secs (0.676u,0.005s) (success)
Tactic call ran for 0.461 secs (0.455u,0.005s) (success)
Finished transaction in 0.164 secs (0.163u,0.s) (successful)
Finished transaction in 0. secs (0.u,0.s) (successful)
Finished transaction in 1.749 secs (1.409u,0.339s) (successful)
Finished transaction in 0. secs (0.u,0.s) (successful)
Finished transaction in 0.402 secs (0.398u,0.003s) (successful)
Finished transaction in 0.001 secs (0.001u,0.s) (successful)
Finished transaction in 0.002 secs (0.002u,0.s) (successful)
Tactic call ran for 1.098 secs (1.094u,0.003s) (success)
Tactic call ran for 1.309 secs (1.306u,0.003s) (success)
Finished transaction in 0.197 secs (0.197u,0.s) (successful)
Finished transaction in 0. secs (0.u,0.s) (successful)

view this post on Zulip Guillaume Melquiond (May 02 2022 at 12:18):

Could you add Optimize Heap before the tactic? Just to make sure that you are not measuring some spurious GC.

view this post on Zulip Janno (May 02 2022 at 12:18):

Hm, I am not sure that I am measuring the right thing with perf. The profile clearly shows that coqtop on 4.14 ran for less time thant coqtop on 4.07 but the Time and time output I get in coqtop are the same as in coqc which would indicate that the slowdown goes the same way in both coqtop and coqc

view this post on Zulip Janno (May 02 2022 at 12:21):

Let's ignore the oddity about coqtop seemingly becoming faster for now. The slowdown in general is definitely real as shown by the timings.

view this post on Zulip Janno (May 02 2022 at 12:21):

Optimize Heap doesn't change anything

view this post on Zulip Janno (May 02 2022 at 13:07):

Actually, it does make a difference if I duplicate the proof on 4.14 and run Optimize Heap in between the two proofs I get the same timings as on 4.07 on the second proof.

view this post on Zulip Janno (May 02 2022 at 13:08):

This could also explain something else that bothers me: across our code base the slowdown is basically equal in absolute terms despite the fact that the files vary in total execution time (and especially time spent in our custom tactics) between a few seconds and a handful of minutes

view this post on Zulip Janno (May 02 2022 at 13:10):

Oooh, it's not even because of Optimize Heap. Even without that the second copy of the proof has the same performance as that on 4.07

view this post on Zulip Janno (May 02 2022 at 13:10):

It seems the first run of our custom tactic does all the GC and then the rest of the file can be fast again

view this post on Zulip Janno (May 02 2022 at 13:12):

Going through the proof once and then backtracking also works (which makes sense, I guess)

view this post on Zulip Janno (May 02 2022 at 13:17):

I don't understand any of this. Here's a trace of running the same pure ltac tactic (not our custom OCaml code) four times in a row on 4.14: (using Succeed to undo any effects)

Tactic call ran for 1.14 secs (1.136u,0.003s) (success)
Tactic call ran for 0.922 secs (0.92u,0.001s) (success)
Tactic call ran for 1.146 secs (1.144u,0.002s) (success)
Tactic call ran for 0.589 secs (0.588u,0.001s) (success)

view this post on Zulip Janno (May 02 2022 at 13:17):

It stabilizes around 0.58s after that

view this post on Zulip Gaëtan Gilbert (May 02 2022 at 13:18):

does Print Debug GC before and after the command reveal anything?

view this post on Zulip Janno (May 02 2022 at 13:24):

image.png

minor words promoted words  major words minor_collections   major_collections   heap_words  heap_chunks live_words  live_blocks free_words  free_blocks largest_free    fragments   compactions top_heap_words  stack_size
166775298   20578668    120996621   14  2   122009088   37  120794960   32342804    1213061 1   1213061 1067    0   122009088   143
356783916   21608756    122726284   22  3   122009088   37  121168740   32321028    812131  11487   156853  28217   0   122009088   143
545908748   22144549    123270121   28  4   122009088   37  119591030   32182380    2381313 63833   425412  36745   0   122009088   143
735032603   22887186    124020248   33  4   122009088   37  119702200   32209630    2264512 71712   425412  42376   0   122009088   143
924165624   23499121    124644131   39  4   122009088   37  120325853   32386566    1619440 12091   425412  63795   0   122009088   143

view this post on Zulip Janno (May 02 2022 at 13:24):

This time it took 5 repetitions of the tactic to make it go fast.

view this post on Zulip Janno (May 02 2022 at 13:24):

Oh, no, it's still 4

view this post on Zulip Janno (May 02 2022 at 13:24):

The first line is from before running the tactic the first time

view this post on Zulip Janno (May 02 2022 at 13:27):

AFK for a while. Let me know if there is any other data I can collect. I'll gladly do so!

view this post on Zulip Janno (May 02 2022 at 14:18):

Back again. I can't say that these numbers mean anything to me. It doesn't look like anything dramatic happens between the last two rows.

view this post on Zulip Janno (May 02 2022 at 14:33):

I can recreate the same weird effect by posing a much simpler goal before the original lemma, letting our automation tactic Succeed on it 50 times, Aborting that proof and continuing with the original lemma, which now runs very fast. (Even faster than 4.07.)

view this post on Zulip Janno (May 02 2022 at 14:35):

These 50 runs on the simpler goal vary a lot in execution time:
0.081,0.008,0.008,0.008,0.051,0.009,0.047,0.009,0.259,0.009,0.159,0.009,0.127,0.008,0.008,0.008,0.053,0.009,0.047,0.008,0.257,0.009,0.157,0.008,0.134,0.008,0.008,0.008,0.047,0.009,0.009,0.008,0.01,0.009,0.009,0.008,0.009,0.008,0.009,0.008,0.008,0.008,0.008,0.008,0.008,0.017,0.018,0.009,0.011,0.008

view this post on Zulip Janno (May 02 2022 at 14:35):

It seems I need to waste about 1.5s in GC before the rest of the file becomes fast. :D

view this post on Zulip Janno (May 02 2022 at 14:37):

(I should mention that this effect was also observed on other machines. It's not just my computer.)

view this post on Zulip Janno (May 02 2022 at 14:48):

I wonder if the GC in 4.14 is just much more aggressive. coqc max resident set size on 4.14 is about 10% below that on 4.07. If I tune the settings with OCAMLRUNPARAM='a=2,o=12000' I get comparable max rss and performance, even slightly better performance on 4.14.

view this post on Zulip Janno (May 02 2022 at 14:50):

(12000 was picked more or less randomly after 120 didn't yield perform as nice as 4.07)

view this post on Zulip Janno (May 02 2022 at 15:46):

Wow! I ran our CI on our entire development with 4.14 and OCAMLRUNPARAM='a=2,o=800' (which seems to be the highest value that still makes a difference in execution time on my local machine) and the total time went from ~25 minutes (similar to the total on 4.07) to 16 minutes. CPU instructions increased in all proofs, sometimes as much as +25%, but despite that some of these same proofs saw a decrease of total runtime of several minutes or up to 40% in some cases.

view this post on Zulip Janno (May 02 2022 at 15:46):

This is the best reason for not just looking at cpu instructions I have seen in a long time.

view this post on Zulip Pierre-Marie Pédrot (May 02 2022 at 15:50):

You're trading memory for speed, wasn't it somehow expected though?

view this post on Zulip Janno (May 02 2022 at 15:51):

You assume I know what I doing with these parameters. As far as I can tell from my limited testing the memory consumption is now similar to that on 4.07 so I wasn't expecting such a speedup.

view this post on Zulip Janno (May 02 2022 at 15:51):

Also, why are there so many more cpu instructions with these parameters?

view this post on Zulip Janno (May 02 2022 at 16:02):

Okay, the memory consumption is not all that comparable. Some files take twice as much memory (despite being no faster with these GC settings on 4.14).

view this post on Zulip Janno (May 02 2022 at 16:03):

Still, it's nice to know that we can cut our CI time by 30% by using all the unused memory on our CI servers

view this post on Zulip Janno (May 02 2022 at 16:04):

Sadly that means that some files are still slower on 4.14 :(


Last updated: Feb 05 2023 at 19:29 UTC