Does Coq use different settings for GC in coqtop versus coqc? If so, could that epxlain why I am seeing a reproducible speedup from OCaml 4.07 to 4.14 in coqtop but a (much much larger) reproducible slowdown in coqc?
I am asking about GC specifically because the perf profile of coqc on 4.14 shows a huge amount of extra GC time. These functions also trigger GC on 4.07 but to much a lesser extent, according to perf
. The file I am looking at suffers a 10x performance degradation from this phenomenon in one specific (custom plugin) OCaml function.
relevant code is around https://github.com/coq/coq/blob/fdc8569cf9a24bcd90e3db218f56999b78557808/sysinit/coqinit.ml#L52-L76
it seems to be called by both coqtop and coqc
is the slowdown everywhere or for specific operations? eg it could be slower at writing the vo
The slowdown in this file I am looking is almost entirely in one of our own custom tactics implement in OCaml. 4.07 takes about 100ms to execute it in coqc but 4.14 takes 1.2s due to GC.
what about coqtop times?
For some reason perf
does not want to capture the tactic invocation there. But coqtop on 4.14 is about 15% faster than coqtop on 4.07.
what does Time say in all 4 cases?
Here's two traces from coqc, the first on 4.07 and the second on 4.14. The second Tactic call
is the custom Ocaml tactic, the first is a pure ltac
tactic. The second to last item is Qed
. The rest is Requires/Imports/Section/End, etc.
Finished transaction in 1.847 secs (1.51u,0.336s) (successful)
Finished transaction in 0. secs (0.u,0.s) (successful)
Finished transaction in 0.142 secs (0.138u,0.003s) (successful)
Finished transaction in 0.001 secs (0.001u,0.s) (successful)
Finished transaction in 0.001 secs (0.001u,0.s) (successful)
Tactic call ran for 0.681 secs (0.676u,0.005s) (success)
Tactic call ran for 0.461 secs (0.455u,0.005s) (success)
Finished transaction in 0.164 secs (0.163u,0.s) (successful)
Finished transaction in 0. secs (0.u,0.s) (successful)
Finished transaction in 1.749 secs (1.409u,0.339s) (successful)
Finished transaction in 0. secs (0.u,0.s) (successful)
Finished transaction in 0.402 secs (0.398u,0.003s) (successful)
Finished transaction in 0.001 secs (0.001u,0.s) (successful)
Finished transaction in 0.002 secs (0.002u,0.s) (successful)
Tactic call ran for 1.098 secs (1.094u,0.003s) (success)
Tactic call ran for 1.309 secs (1.306u,0.003s) (success)
Finished transaction in 0.197 secs (0.197u,0.s) (successful)
Finished transaction in 0. secs (0.u,0.s) (successful)
Could you add Optimize Heap
before the tactic? Just to make sure that you are not measuring some spurious GC.
Hm, I am not sure that I am measuring the right thing with perf
. The profile clearly shows that coqtop on 4.14 ran for less time thant coqtop on 4.07 but the Time
and time
output I get in coqtop are the same as in coqc which would indicate that the slowdown goes the same way in both coqtop and coqc
Let's ignore the oddity about coqtop
seemingly becoming faster for now. The slowdown in general is definitely real as shown by the timings.
Optimize Heap
doesn't change anything
Actually, it does make a difference if I duplicate the proof on 4.14 and run Optimize Heap
in between the two proofs I get the same timings as on 4.07 on the second proof.
This could also explain something else that bothers me: across our code base the slowdown is basically equal in absolute terms despite the fact that the files vary in total execution time (and especially time spent in our custom tactics) between a few seconds and a handful of minutes
Oooh, it's not even because of Optimize Heap
. Even without that the second copy of the proof has the same performance as that on 4.07
It seems the first run of our custom tactic does all the GC and then the rest of the file can be fast again
Going through the proof once and then backtracking also works (which makes sense, I guess)
I don't understand any of this. Here's a trace of running the same pure ltac tactic (not our custom OCaml code) four times in a row on 4.14: (using Succeed
to undo any effects)
Tactic call ran for 1.14 secs (1.136u,0.003s) (success)
Tactic call ran for 0.922 secs (0.92u,0.001s) (success)
Tactic call ran for 1.146 secs (1.144u,0.002s) (success)
Tactic call ran for 0.589 secs (0.588u,0.001s) (success)
It stabilizes around 0.58s after that
does Print Debug GC before and after the command reveal anything?
minor words promoted words major words minor_collections major_collections heap_words heap_chunks live_words live_blocks free_words free_blocks largest_free fragments compactions top_heap_words stack_size
166775298 20578668 120996621 14 2 122009088 37 120794960 32342804 1213061 1 1213061 1067 0 122009088 143
356783916 21608756 122726284 22 3 122009088 37 121168740 32321028 812131 11487 156853 28217 0 122009088 143
545908748 22144549 123270121 28 4 122009088 37 119591030 32182380 2381313 63833 425412 36745 0 122009088 143
735032603 22887186 124020248 33 4 122009088 37 119702200 32209630 2264512 71712 425412 42376 0 122009088 143
924165624 23499121 124644131 39 4 122009088 37 120325853 32386566 1619440 12091 425412 63795 0 122009088 143
This time it took 5 repetitions of the tactic to make it go fast.
Oh, no, it's still 4
The first line is from before running the tactic the first time
AFK for a while. Let me know if there is any other data I can collect. I'll gladly do so!
Back again. I can't say that these numbers mean anything to me. It doesn't look like anything dramatic happens between the last two rows.
I can recreate the same weird effect by posing a much simpler goal before the original lemma, letting our automation tactic Succeed
on it 50 times, Abort
ing that proof and continuing with the original lemma, which now runs very fast. (Even faster than 4.07.)
These 50 runs on the simpler goal vary a lot in execution time:
0.081,0.008,0.008,0.008,0.051,0.009,0.047,0.009,0.259,0.009,0.159,0.009,0.127,0.008,0.008,0.008,0.053,0.009,0.047,0.008,0.257,0.009,0.157,0.008,0.134,0.008,0.008,0.008,0.047,0.009,0.009,0.008,0.01,0.009,0.009,0.008,0.009,0.008,0.009,0.008,0.008,0.008,0.008,0.008,0.008,0.017,0.018,0.009,0.011,0.008
It seems I need to waste about 1.5s in GC before the rest of the file becomes fast. :D
(I should mention that this effect was also observed on other machines. It's not just my computer.)
I wonder if the GC in 4.14 is just much more aggressive. coqc max resident set size on 4.14 is about 10% below that on 4.07. If I tune the settings with OCAMLRUNPARAM='a=2,o=12000'
I get comparable max rss and performance, even slightly better performance on 4.14.
(12000 was picked more or less randomly after 120 didn't yield perform as nice as 4.07)
Wow! I ran our CI on our entire development with 4.14 and OCAMLRUNPARAM='a=2,o=800'
(which seems to be the highest value that still makes a difference in execution time on my local machine) and the total time went from ~25 minutes (similar to the total on 4.07) to 16 minutes. CPU instructions increased in all proofs, sometimes as much as +25%, but despite that some of these same proofs saw a decrease of total runtime of several minutes or up to 40% in some cases.
This is the best reason for not just looking at cpu instructions I have seen in a long time.
You're trading memory for speed, wasn't it somehow expected though?
You assume I know what I doing with these parameters. As far as I can tell from my limited testing the memory consumption is now similar to that on 4.07 so I wasn't expecting such a speedup.
Also, why are there so many more cpu instructions with these parameters?
Okay, the memory consumption is not all that comparable. Some files take twice as much memory (despite being no faster with these GC settings on 4.14).
Still, it's nice to know that we can cut our CI time by 30% by using all the unused memory on our CI servers
Sadly that means that some files are still slower on 4.14 :(
Last updated: Sep 15 2024 at 13:02 UTC