I find myself once again suffering acutely from the problem described in https://github.com/coq/coq/issues/8721. I can reliably reduce the time spent in
CClosure.ref_value_cache by 70% if I carefully unfold all my universe polymorphic constants before I give the terms to
CClosure.whd_stack. Can anyone think of a solution to this problem? Josh and I tried to delay the universe substitution for polymorphic constants (https://github.com/coq/coq/pull/8747) but it turned out to do more work in general than the current eager substitution. Are there other ways of attacking this?
I don't understand/remember what happened to the associated PR. The bench shows ~1.5% slowdown on mathcomp stuff. The comments say
I think we identified the reason for such large slowdowns (it is because we can actually do significantly more work [repeatedly reapplying the same substitutions] when we defer universe substitutions in this way)
Did I forget some off-bench slowdown?
No, I think all the comments were about the bench slowdowns.
Also, in the PR you said
No need to run the benchmarks. The improvements I saw locally come from other commits that I included via a rebase.
Did you try the branch (or a rebase of it) with your new problem?
Or is it still the old problem?
I think that comment about not rerunning was about a potential new speedup from changes we had introduced (together with commits from master, by accident). The original patch definitely improved Mtac2 performance.
It's a new example and I haven't tried the patch again. It would probably be quite hairy to get all the dependencies of the current example compiled with a patched Coq version.
If we can't try them there's no point in hypothesizing solutions.
Right, I can certainly try to come up with a self-contained version of this that is easier to benchmark
But I know the underlying problem is still the same. As soon as I define my terms with an
Eval unfold <all polymorphic functions> in .. preamble I get very measurable (20% overall time in this random example) speed-ups.
And the time spent in
Constr.map (transitively called from
ref_value_cache, so probably substitution?) goes down from ~90% to 20% (relative to the time spent in
ref_value_cache). I.e. the remaining constant unfolding work now mostly doesn't need to traverse terms anymore. At least that's my naive interpretation.
Is it the same if instead of
unfold you use
lazy delta [list of constants] or
cbv delta [...]?
lazy delta works even better than
unfold (probably because I am bad at putting constants in the right order)
unfold looks like it just calls
lazy to me so not sure what's going on
I thought unfold x1 .. xN would unfold in order and if you get the order wrong then xN might unfold to xN-1 and that then remains folded
In any case the difference is immense. Let me see what
perf says regarding the time spent in reduction & substitution.
must have misread the code then
No, you are probably right. The Mtac2 program simply doesn't work anymore but since my way of bencharmking is
do 100!(my_code; fail) I didn't notice :(
what about cbv btw
same as lazy in that the program breaks. they both seem to reduce the term too much?
indeed the order matters
Definition one := 0. Definition two := one. Definition three := Eval unfold one, two in two. Print three. (* three = one *) Definition four := Eval unfold two, one in two. Print four. (* four = 0 *)
lazy may be reducing too little (?)
Definition fo := Eval unfold plus in let x := 0 in x. Print fo. (* fo = 0 *) Definition fo' := Eval lazy delta [plus] in let x := 0 in x. Print fo'. (* fo' = let x := 0 in x *)
unfold foo should be equivalent to
lazy beta iota zeta delta [foo], does that work? what about cbv with same flags?
Too little should never be a problem for Mtac2. CClosure will have to chew some more but that's it.
I am actually getting a ReductionFailure in Mtac2 which happens when arguments to the reduction primitive are not in normal form. This is so weird.
Let me try!
lazy produces a term that runs as fast as than the one generated by
unfold (and it runs successfully, this time).
cbv seems to generate term that takes longer to reduce but not by much.
The cbv step is same speed but the mtac running is slower?
Oh, I am not yet measuring the time it takes to unfold things, only how fast the resulting code runs.
Can you try to find a difference between the produced terms for lazy vs cbv?
(we will need to finish some other time, sleep approaches)
I'll try. Timing wise the (pre)reduction steps are very similar: 0.125ms for cbv, 0.135ms for lazy (seem to be pretty stable)
Sure! Thanks for your help and have a good night (and weekend)!
It seems the terms are actually identical and I just got unlucky when running the timings further down in the file. The difference was small (6.2s for cbv versus of 5.9s for lazy) but they were the same over 3 iterations with each version. Now when I run them I get 5.9 for both versions.
And when I cleverly re-arrange the terms given to
unfold I can get all three methods to produce the same term. This has been a fun excursion! :)
Last updated: Dec 05 2023 at 06:01 UTC