Stream: Coq devs & plugin devs

Topic: CClosure Cache & Universes


view this post on Zulip Janno (Jun 05 2020 at 20:27):

I find myself once again suffering acutely from the problem described in https://github.com/coq/coq/issues/8721. I can reliably reduce the time spent in CClosure.ref_value_cache by 70% if I carefully unfold all my universe polymorphic constants before I give the terms to CClosure.whd_stack. Can anyone think of a solution to this problem? Josh and I tried to delay the universe substitution for polymorphic constants (https://github.com/coq/coq/pull/8747) but it turned out to do more work in general than the current eager substitution. Are there other ways of attacking this?

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 20:52):

I don't understand/remember what happened to the associated PR. The bench shows ~1.5% slowdown on mathcomp stuff. The comments say

I think we identified the reason for such large slowdowns (it is because we can actually do significantly more work [repeatedly reapplying the same substitutions] when we defer universe substitutions in this way)

but

Did I forget some off-bench slowdown?

view this post on Zulip Janno (Jun 05 2020 at 20:54):

No, I think all the comments were about the bench slowdowns.

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 20:55):

Also, in the PR you said

No need to run the benchmarks. The improvements I saw locally come from other commits that I included via a rebase.

Did you try the branch (or a rebase of it) with your new problem?

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 20:56):

Or is it still the old problem?

view this post on Zulip Janno (Jun 05 2020 at 20:57):

I think that comment about not rerunning was about a potential new speedup from changes we had introduced (together with commits from master, by accident). The original patch definitely improved Mtac2 performance.

view this post on Zulip Janno (Jun 05 2020 at 20:57):

It's a new example and I haven't tried the patch again. It would probably be quite hairy to get all the dependencies of the current example compiled with a patched Coq version.

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 20:58):

If we can't try them there's no point in hypothesizing solutions.

view this post on Zulip Janno (Jun 05 2020 at 20:58):

Right, I can certainly try to come up with a self-contained version of this that is easier to benchmark

view this post on Zulip Janno (Jun 05 2020 at 21:00):

But I know the underlying problem is still the same. As soon as I define my terms with an Eval unfold <all polymorphic functions> in .. preamble I get very measurable (20% overall time in this random example) speed-ups.

view this post on Zulip Janno (Jun 05 2020 at 21:03):

And the time spent in Constr.map (transitively called from ref_value_cache, so probably substitution?) goes down from ~90% to 20% (relative to the time spent in ref_value_cache). I.e. the remaining constant unfolding work now mostly doesn't need to traverse terms anymore. At least that's my naive interpretation.

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:04):

Is it the same if instead of unfold you use lazy delta [list of constants] or cbv delta [...]?

view this post on Zulip Janno (Jun 05 2020 at 21:05):

No, lazy delta works even better than unfold (probably because I am bad at putting constants in the right order)

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:07):

unfold looks like it just calls lazy to me so not sure what's going on

view this post on Zulip Janno (Jun 05 2020 at 21:08):

I thought unfold x1 .. xN would unfold in order and if you get the order wrong then xN might unfold to xN-1 and that then remains folded

view this post on Zulip Janno (Jun 05 2020 at 21:08):

In any case the difference is immense. Let me see what perf says regarding the time spent in reduction & substitution.

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:09):

must have misread the code then

view this post on Zulip Janno (Jun 05 2020 at 21:11):

No, you are probably right. The Mtac2 program simply doesn't work anymore but since my way of bencharmking is do 100!(my_code; fail) I didn't notice :(

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:11):

what about cbv btw

view this post on Zulip Janno (Jun 05 2020 at 21:12):

same as lazy in that the program breaks. they both seem to reduce the term too much?

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:12):

indeed the order matters

Definition one := 0.
Definition two := one.

Definition three := Eval unfold one, two in two.
Print three. (* three = one *)

Definition four := Eval unfold two, one in two.
Print four. (* four = 0 *)

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:16):

lazy may be reducing too little (?)

Definition fo := Eval unfold plus in let x := 0 in x.
Print fo. (* fo = 0 *)

Definition fo' := Eval lazy delta [plus] in let x := 0 in x.
Print fo'. (* fo' = let x := 0 in x *)

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:18):

unfold foo should be equivalent to lazy beta iota zeta delta [foo], does that work? what about cbv with same flags?

view this post on Zulip Janno (Jun 05 2020 at 21:19):

Too little should never be a problem for Mtac2. CClosure will have to chew some more but that's it.

view this post on Zulip Janno (Jun 05 2020 at 21:19):

I am actually getting a ReductionFailure in Mtac2 which happens when arguments to the reduction primitive are not in normal form. This is so weird.

view this post on Zulip Janno (Jun 05 2020 at 21:19):

Let me try!

view this post on Zulip Janno (Jun 05 2020 at 21:22):

lazy produces a term that runs as fast as than the one generated by unfold (and it runs successfully, this time). cbv seems to generate term that takes longer to reduce but not by much.

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:24):

The cbv step is same speed but the mtac running is slower?

view this post on Zulip Janno (Jun 05 2020 at 21:25):

Oh, I am not yet measuring the time it takes to unfold things, only how fast the resulting code runs.

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:26):

Can you try to find a difference between the produced terms for lazy vs cbv?

view this post on Zulip Gaëtan Gilbert (Jun 05 2020 at 21:28):

(we will need to finish some other time, sleep approaches)

view this post on Zulip Janno (Jun 05 2020 at 21:28):

I'll try. Timing wise the (pre)reduction steps are very similar: 0.125ms for cbv, 0.135ms for lazy (seem to be pretty stable)

view this post on Zulip Janno (Jun 05 2020 at 21:28):

Sure! Thanks for your help and have a good night (and weekend)!

view this post on Zulip Janno (Jun 05 2020 at 21:35):

It seems the terms are actually identical and I just got unlucky when running the timings further down in the file. The difference was small (6.2s for cbv versus of 5.9s for lazy) but they were the same over 3 iterations with each version. Now when I run them I get 5.9 for both versions.

view this post on Zulip Janno (Jun 05 2020 at 21:37):

And when I cleverly re-arrange the terms given to unfold I can get all three methods to produce the same term. This has been a fun excursion! :)


Last updated: Oct 16 2021 at 02:03 UTC