I am seeing a pretty significant performance regression in ssreflect's apply:
tactic, with some files spending 300% as much time in the tactic as they do in 8.15. The maximum time out of ~3300 calls to apply:
in the example I am looking at only goes from 9ms to 11ms (says Ltac Profile
so take that with a grain of salt). So it seems the the tactic just got uniformly slower somehow. I am preparing flame graphs and all that but I wanted to ask first if anybody had a hunch about what might have changed. I couldn't find relevant changes to the tactic itself between 8.15 and 8.16.
Initial perf profile results show that the profile of apply:
on 8.16 is dominated by evarsolve functions related to aliases: image.png
For comparison, on 8.15 it's mostly kernel reduction: image.png
Is this possibly related to reversible coercions? I don't see benchmark results in https://github.com/coq/coq/pull/15693 but the discussion is huge so maybe they are hidden somewhere
Oh, there is at least one benchmark in there: https://gitlab.com/coq/coq/-/jobs/2100592270. To me it seems like a pretty clear regression in ssreflect-heavy projects.
Be that as it may, I am not actually convinced that MR is to blame for the profile I am seeing. I cannot even find the word alias
in the diff.
Would be great if you can have a small test case so we can bisect
I'll try to extract something. I am looking at a closed-sourced development and it will take a while to remove everything I am not allowed to show.
Yes, alternatively you can bisect yourself, if you use the dune build of Coq that is usually pretty fast
tho I'm unsure how deep your problem is into your dev of course
Since I am already looking at benchmarks of PRs that could be related: https://github.com/coq/coq/pull/15789 is another PR with a clear (albeit smaller) performance regression in ssreflect-heavy projects: https://github.com/coq/coq/pull/15789#issuecomment-1065881904
Does #16450 change anything? (it's a master PR, but maybe you can try it there)
I don't think I can bisect this problem without minimizing it first. The 8.16 bump was not particularly easy for our development (for a host of reasons) so I expect exactly 0 points between 8.15 and 8.16 to build smoothly.. Also, even just the dependencies in the project itself take roughly 10 minutes to build.
The performance results of that PR certainly look promising :D
But it seems https://github.com/coq/coq/pull/16448 is the reason for the performance improvements?
For the observable ones, yes. This PR was originally part of the other one.
You can try to backport it, it's very self-contained.
I'll try that.
(I'd still be curious to get small examples stressing the aliasing mechanism.)
Honestly I'm not even sure why we have such a contraption in evarsolve.
I backported the two MRs and repeated the experiment. The regression has reduced in magnitude to 1.7x (instead of 3x).
Evarsolve functions essentially disappeared from the profile so that's great
There is still something going on with strings that wasn't as strongly represented in the profile in 8.15
Here's the profile on 8.15: https://share.firefox.dev/3VEAfWW
and here's the profile on 8.16 with the backported MRs: https://share.firefox.dev/3yVQ6qt
Those are inverted call stacks restricted only to apply:
itself (by searching for ssrbwd
). The percentages are not comparable between the two because the overall time varies. But the number of samples is comparable (99 samples per second in both cases).
Actually, now that I click around a bit, it seems aliasing computation is still the culprit. caml_string_compare
(second in the profile on patched 8.16) contains 1.7% Evarsolve.compare_alias
.
So, it seems the two MRs have removed a lot of internal overhead of alias computation but there is still much more computation done compared to 8.15.. only that now all the time is spent in functions called by these Evarsolve
functions
yes, it's not solving the macroscopic problem, just the fact that the old code was especially slow
if you could bisect to find the source of the evarsolve blowup it'd be great
I am not sure how to bisect this. I think I first need to read the code paths to get an idea of what might be a good reproduction case.
@Janno you are not sure in terms of build / git commands?
No, I am not sure about finding a good example. I have over 3000 thousand calls to apply:
and every single one only takes a few ms at best. I don't know how to find out which calls are promising to demonstrate the regression.
where does "invoke one of those 1000 times" fail?
(not saying that's a good idea, but I hope it leads to one?)
Indeed.. I guess I can start adding idtacs
to all our uses of apply:
and start running each one a thousand times until I find one that clearly shows the regression
(deleted, moved to a new topic)
Okay, that's several hours of my life I will never get back. But I managed to finally reproduce the problem outside of my original test case:
From Coq Require Import ssreflect.
Axiom W : forall (P Q: Prop), Prop.
Axiom W_refl : forall P, W P P.
Axiom W_coerce : forall {P Q} (w : W P Q), forall x : unit, False.
Coercion W_coerce : W >-> Funclass.
Goal True.
do 10 (let n := fresh in pose proof (n := I)).
Time do 1000 (try (apply : W_refl; fail)).
The problem only becomes apparent in non-empty contexts. 10
assumptions almost doubles the time the failing apply:
calls take on 8.16 compared to 8.15, 20
assumptions triple it, and 100
increase it by a factor of 50.
The coercion is also necessary to trigger the problem.
Thanks!
A quick bisect seems to indicate that the culprit is 828bdda605ba7737364f06f8cc00a318d1ee585b.
Author: Pierre Roux <pierre@roux01.fr>
Date: Mon Mar 14 15:57:00 2022 +0100
Remove uniform inheritance condition for coercions
This removes the uniform inheritance condition by using
inference (mostly unification, but also canonical structures or
typeclasses) to obtain the parameters of coercions.
I just arrived at the same conclusion
cc @Pierre Roux
I remember when we benched this we didn't observe any performance overhead.
I am not surprised given that the regression was already indicated by the benchmarks.
Clearly that means we are benching the wrong things.
Oh, I think we read the results differently
https://github.com/coq/coq/pull/15789#issuecomment-1065881904 clearly shows significant slowdowns in projects that use ssreflect
0.70% is well within noise range
does it?
One day I will convince all of you to stop looking at the noise generator that is the timing column
Until that day I'll spend my evenings after Coq bumps bisecting performance regressions ;)
Where should we be looking?
cpu instructions
that also seems well within noise range
No way
The variance for cpu instructions is much much smaller than that for timing results
How is it measured?
yes but changing code can have bigger effects at that level
if the OS does other stuff inbetween does that get included?
https://github.com/coq/coq/pull/16609 has a bunch of benches with no changes if you want to calibrate your understanding of bench noise
I'd rather look at the user time per command
@Janno your example seems to indicate some bad algorithmics, it's superlinear in the size of the context
@Gaëtan Gilbert great examples! https://github.com/coq/coq/pull/16609#issuecomment-1272981665 this is what cpu instruction noise looks like
still, < 1% additional CPU instruction is not very concerning even if this is real
even assuming 0% noise this is a very small effect
there are many reasons why we could have such a difference
Well, that totally depends. Is it a new feature that comes with an inherent cost? Great, 1% could be fine. But what if it's a change that ought to be free (maybe because nobody is even using the new feature). Then 1% is very concerning.
well, everybody uses coercions, so...
Sure, in this case it's not clear to me that this change can be made without a performance regression. But I don't think there was any discussion about the impact. (Which makes sense because people thought it was just noise.)
@Pierre-Marie Pédrot can you perf the example and see what is busy?
it has the same kind of profile as the fast one actually
i.e. it's Evarsolve taking time in both cases
most likely this is bad algorithmics, i.e. we're quadratically calling the unification or something
I fear exceptions and the control flow they bring, amongst others
I have to go sleep. Thanks for looking into this!
same here, will have a look tomorrow, thanks again all
I think I have an intuition about what's going wrong.
Before we would synthesize the arguments of the coercion from static data, but now we're generating a bunch of evars that are solved by unification.
I think that we could simply reinstate the old behaviour for uniform coercions, since we already know the arguments.
So the evars are what makes it scale so badly in the size of the context?
Is evar allocation still ... linear? in the size of the context?
Pierre-Marie Pédrot said:
Before we would synthesize the arguments of the coercion from static data, but now we're generating a bunch of evars that are solved by unification.
Or, said otherwise: before we had an ad hoc piece of code working only in specific cases (likely for historical reasons), now we use the generic mechanism used everywhere else in the elaborator.
Yes, but this doesn't prevent us to be more clever in simpler cases.
Sure, if we really need a fast path, let's do it.
Who is working on the issue? (just to avoid duplicate work)
@Pierre-Marie Pédrot did you implement a compact representation for evar instances? Is the time to allocate the evar info which makes the difference?
no, the problem is really that we're solving unification problems of the form ?e[Γ] ~ M and this is just expensive
the fastest work is the one you don't perform
I'm working on the issue but this is trickier than partially reverting the code, because there were other changes atop of that it seems
my naive attempt to pass the inferred arguments didn't solve the issue
Pierre-Marie Pédrot said:
no, the problem is really that we're solving unification problems of the form ?e[Γ] ~ M and this is just expensive
I think the optimization you want to pull should be a last resort.
Imo Γ |- ?e[Γ] ~ M
should not be expensive, especially if ?e is fresh, hence cannot occur in M.
I don't think we have a way to tell that some evar if fresher than an arbitrary term
Not in general, but in this case we know it.
I'm releasing the lock to try to find a fix, and I can share my attempt if you want to have a look
but we would need some special api to unification to use that knowledge
I have other on my place, I was just suggesting a more general fix
we're doing clever stuff with evar-term problems though
Yes, I imagine it is not trivial, since substitution can duplicate
the function solving evar-term problems is literally the one that dominates every profile
Anyway, since I don't have the time to try my approach, I guess I should cave ;-)
Here I see 2 sources of optimization:
at least the former should be easy to exploit, since all the info to check for the fast code path is passed solve_simple....
maybe it is already the case, I don't recall. But in that case I don't get the slowdown.
what is invert_definition supposed to produce when given an identity evar?
a no-op?
so this needs to be hardcoded because it's not the case today
it computes context data in O(n) and more
wait, maybe it still needs to do named -> debruijn in Coq
(disclaimer: unification is a black box to me)
I never recall if the solution in the evar map are rel or var based
wdym?
evar instances are var based if this is what you mean
ok, then the optimization should be doable
I think we can make it way faster easily as of today though
if the local subst is the identity, then it is a FO problem, you only need to do OC and if it is not found you assign the evar map
there is a fast-path check but it occurs after some costly precomputation was performed
putting the fast path in the right place doesn't seem to help on this example though
maybe we're not unifying identity evars
It seems this avenue has been very fruitful! https://github.com/coq/coq/pull/16703#issuecomment-1286288827
How does this change relate to the problem of creating unnecessary evars?
It optimizes a common case where HO unif boils down to FO unif (and there is still room for improvement).
So it can add up to what was discussed here, ie restore the old code path in some cases. As I said before I'd prefer making trivial evars costless (in all projects) rather than making coercion code more complex. But we will see.
@Pierre-Marie Pédrot how does the pr affect @Janno s micro bench?
putting the fast path in the right place doesn't seem to help on this example though
This seems to indicate that it doesn't
oops, missed that.
BTW there is also a performance regression in https://github.com/coq/coq/pull/15693 that affected unification-heavy projects. I have no idea what it is and how to address it but I thought I'd mention it again. Maybe it will lead to another discovery of a late fast path :)
sorry about that, our bench system did not discover it. are the projects public?
Oh, sorry, I didn't mean I had found this to impact our code. I just mean that the bench results show a regression.
It could be that it impacts our code but if it did the impact might be masked by other improvements. Removing a bunch of (fortunately) unused coercions to Funcclass
has basically entirely undone the regressions that I could identify. At least the ones in proofs. I still have a 10% performance regression in parsing somewhere but I haven't gotten around to minimizing that.
Janno said:
I just mean that the bench results show a regression.
What do you mean precisely? benchs were launched before merging the PR and no regression appeared.
It is a new feature, so it does run more code. The problem is to quantify it. I know you expressed concerns about "the wallclock" measure, I can add we have no unit-benchmark, so we are very macroscopic in our observations.
Yeah, I am definitely going to sound like a broken record but https://gitlab.com/coq/coq/-/jobs/2100592270, even with the most generous reading, shows at least a very clear potential of a slowdown in all of mathcomp. If we look at the usual noise range for math comp wall clock time I think it is even justified to say that there is a regression, albeit it only a small one in wall clock time. It's not entirely unusual to see changes that increase instruction count but don't increase wall clock time as much but I almost never see increases like that that do not also lead at least to some actual slowdown. I don't have as much experience with changes to Coq internals so the picture might be a bit different here. In my work (even in Coq plugins) I basically see a 1-1 correspondence between the two measurements. The only discontinuity I have seen in the last years was when we switched from OCaml 4.07 to OCaml 4.14 where the new GC strategy really changed the picture drastically. But after that the two measurements were once again aligned pretty much perfectly.
A different perspective, although one that certainly needs more data to support it, is that it took me only a half an hour to identify two candidates for the regression I was seeing just by looking at the perf profile, finding MRs that sounded vaguely related to the functions that showed up in the profile, and then filtering for MRs that had regressions in cpu instruction counts in their benchmark results.
So let's be precise, what precisely do you think should have worried us in those results?
These are the results that would worry me: image.png
Sure, I mean which figure(s) and why?
I tried to highlight the relevant entries in the cpu instruction column.
Sorry, did not see them at first.
I probably should have included coq-corn as well
It's easy to miss entries with that many numbers that all look very similar
Note that some of the figures you highlighted on instructions are negative in cycles. Seems pretty hard to distinguish from noise.
Cycles are just wall clock time in disguise and they are equally noisy
Have a look at the (accidental) noise measurements in https://github.com/coq/coq/pull/16609. All the benchmarks that are X against itself
produced very good data for estimating noise.
coq-iris-examples, for example, easily varies by about +/- 1% in wall clock time. So -0.7% in wall clock time but +0.3% in instruction count is a perfectly valid measurement for an 0.3% overall performance regression. It doesn't have to be, but it is possible.
Sure, but again, it is a new feature we can't expect it to have zero cost. Of course we need to measure the benefits v.s. the extra cost. This is hard since the extra cost has a very rough measure. To me a perf regression is "this used to work, now it does not". The one you reported is real, no doubt. But we can't possibly extrapolate that 0.3% means "someone will be stuck".
I'm still unconvinced by the fact that global CPU instructions, even with close to 0% noise, is a good predictor of potential regressions. It is often the case for algorithmic PRs that they add a small constant overhead while drastically enhancing the complexity class of some subcomponents.
(I concur with @Enrico Tassi above, basically.)
I think we are mostly in agreement about all of that. New features cost instructions and that cost, outside of catastrophic complexity problems, is probably not going to block anyone. My worry is solely that regression are missed because they are are misdiagnosed as noise.
Thanks for the explanation, I think I get your point (noise on CPU instr. indeed seems well below 0.5 %).
To supplement Enrico's comment, note that indeed there is currently a small regression on mathcomp but other experiments seems to indicate that once we'll be able to use the feature (i.e. when mathcomp will require Coq >= 8.16), we could expect a small speedup.
I am 100% not advocating for blocking MRs because the introduce a regression in cpu instructions. I just want those to be noticed. Maybe the changelog could even say "Feature X has been added. Note that we except a very small performance regression in projects that use ssreflect because of this new feature".
(I know it's not ssreflect, it's evarconv, so that message wouldn't be correct.)
I realize that my perspective on this is a bit unique. I don't think other people track the performance of their proofs as closely. But I have found this to be the only way to stay (somewhat) sane when dealing with a code base in which the proofs change, the automation changes, the coq versions change, and even OCaml versions change every once in a while. It is every easy to lose track of performance regressions if I don't benchmark defensively. Finding out about regressions after the fact can lead to almost impossible debugging problems. Coq itself suffers from this on a much smaller scale with the ongoing changes to build system infrastructure. I ended up bisecting the regression somewhat manually (i.e. not with my usual git bisect run
approach) because of the dune issues that were also discussed here.
So knowing what might introduce regressions is at least half the battle
regardless of the discussion about noise measurement, there are clear missing optimizations in Evarsolve with the new evar representation
I believe (wrongly?) that solving evar-evar problems with trivial instances should always be immediate
I've had the same thought around trivial instantiations a long time ago when I noticed that Mtac2 was suffering immensely from trivial unification problems because it uses evars for pattern variables and unification for matching patterns against terms. Knowing nothing about unification, I opted to work around the problem by coming up with non-unification based, yet still sound patterm matching alternatives. But the complexity of those is only barely worth the performance benefit so I would very much welcome faster evar instantiation.
I'm looking at the code right now, trying to grasp the behaviour of the algorithm on trivial instances
due to aliasing (ahem) we may be doing non-trivial things on trivial instances it seems
Pierre-Marie Pédrot said:
I believe (wrongly?) that solving evar-evar problems with trivial instances should always be immediate
yes (rightly)
well then it's not
we're leisurely doing a little tour of random aliasing computation and stuff in probably worse than O(n)
also stare at candidates
Wanted: a specification
I don't think it is necessary, since once you unify ?1 with ?2 (trivially) then you do ?2 = t and there you do all the crazy aliasing stuff
?1[x,z] = ?2[x,y] (for y := z) is not a trivial evar-evar problem
I mean, I would not consider it so, hence the aliasing business is not needed in the fast code path
that's not obvious just reading the code
I'd be happy to have this code written in Coq just to evaluate it partially
coq_of_ocaml ;-)
that would be to easy, this stuff is using mutability internally
I confirm that hacking the evar-evar case to return immediately on trivial instances solves the issue at hand
my code is probably slightly unsound though, because we could have the same default instance coding for two expanded instances when filters are involved
https://github.com/coq/coq/pull/16713
I think this variant is sound
Last updated: Dec 07 2023 at 09:01 UTC