Finally I also came across a case of slow QED I can't easily handle. I am verifying with VST a few C functions which essentially compute an approximation for a function with different ranges - larger ranges requiring more complex implementations. The functions are all plain forward processing with calls to abstractions of machine intrinsics - no loops - no ifs - no complicated separation logic - no complicated dependent types. Everything which is not just pure C forward processing logic I moved into separate lemmas already.
My timings are:
simple
assignments = 20
calls = 8
QED = 9s
medium
assignments = 34
calls = 13
QED = 170s
complex
assignments = 48
calls = 19
QED > 11h
The time it takes to construct the proof term is fairly proportional to the C complexity. But for QED I get more than 20x the time for less than 2x the complexity - maybe this is 5th order or more (need to wait for the third QED to finish ...). I have done much more complex stuff which was pretty instant in QED. I am a bit at a loss to understand where this highly non proportional increase in QED complexity comes from. I moved everything which is not just plain forward C processing into separate lemmas already.
The resulting terms get a bit lengthy and in the end I rewrite with a separate lemma which proves the equality between the computed term and the term from the spec. The size of these terms is above average but not that bad (say 1/2 A4 page printed). Could this equality be the issue - as I said the equality itself is proven in a separate lemma and I just apply it.
What are the strategies to debug this? Does it make sense to dump the proof term before QED and make statistics on it?
What processes in type checking the proof term are strongly non proportional and which proof constructs lead to them?
The most common functions in a few 1000 stack dumps are:
Total number in stack (recursive counted multiple, when >=5):
44959 camlReduction__ccnv_5036 (in coqc) + 266 [0x1102ea2ca]
24817 camlCArray__fold_1302 (in coqc) + 181 [0x110386805]
20351 camlReduction__eqappr_5037 (in coqc) + 2324 [0x1102eabf4]
19899 camlReduction__cmp_rec_5570 (in coqc) + 58 [0x1102edcaa]
19640 camlReduction__cmp_rec_5570 (in coqc) + 762 [0x1102edf6a]
19634 camlCArray__fold_1331 (in coqc) + 249 [0x110386d09]
16527 camlTypeops__execute_5551 (in coqc) + 1540 [0x11032b534]
2072 camlTypeops__execute_5551 (in coqc) + 490 [0x11032b11a]
1396 camlTypeops__fun_7619 (in coqc) + 29 [0x11032c69d]
1395 camlCArray__map_i_2680 (in coqc) + 192 [0x1103890f0]
1394 camlTypeops__execute_array_5554 (in coqc) + 163 [0x11032c633]
1393 camlTypeops__execute_5551 (in coqc) + 1769 [0x11032b619]
812 camlCClosure__whd_stack_7172 (in coqc) + 316 [0x1102e59dc]
695 camlReduction__eqappr_5037 (in coqc) + 181 [0x1102ea395]
615 camlReduction__eqappr_5037 (in coqc) + 236 [0x1102ea3cc]
365 camlCClosure__whd_stack_7172 (in coqc) + 295 [0x1102e59c7]
328 camlTypeops__execute_5551 (in coqc) + 1140 [0x11032b3a4]
262 camlReduction__cmp_rec_5570 (in coqc) + 686 [0x1102edf1e]
183 camlReduction__eqappr_5037 (in coqc) + 8125 [0x1102ec29d]
171 camlCClosure__knit_6935 (in coqc) + 27 [0x1102e432b]
145 camlCClosure__knr_6933 (in coqc) + 1600 [0x1102e3a20]
143 camlCClosure__drop_parameters_6548 (in coqc) + 78 [0x1102e00de]
136 camlCClosure__knh_6861 (in coqc) + 0 [0x1102e2770]
131 camlCClosure__knht_6862 (in coqc) + 748 [0x1102e2dfc]
131 camlCClosure__knr_6933 (in coqc) + 1403 [0x1102e395b]
129 camlCClosure__knr_6933 (in coqc) + 0 [0x1102e33e0]
124 camlCClosure__knr_6933 (in coqc) + 1961 [0x1102e3b89]
118 camlReduction__eqappr_5037 (in coqc) + 0 [0x1102ea2e0]
111 camlCClosure__contract_fix_vect_6606 (in coqc) + 617 [0x1102e0619]
105 caml_make_vect (in coqc) + 153 [0x1104360b9] array.c:310
103 caml_page_table_lookup (in coqc) + 0 [0x11042f670] memory.c:95
This is a clear case of conversion going wrong.
It's quite hard to debug, but you can try to wrap your subproofs into opaque definitions if you're courageous enough.
I am still lacking a bit of CS jargon (physicist ...). Is conversion checking what change, cbv and/or simplification do in the proof script?
When there is some lambda-term t
of type A
which is plugged in a place where a term of type B
is expected, a convertibility check is performed to verify that types A
and B
are compatible. Tactics change
, cbv
, etc, replace the current goal with a compatible one, so the kernel will indeed need to check for convertibility at Qed
time. But that is not specific to these tactics. Basically, everything boils down to a convertibility check at Qed
time: reflexivity
, rewrite
, apply
, etc. (That is why Coq's kernel is believed to be a small piece of code, as it is supposed to implement just a convertibility checker.)
Is conversion checking what change, ...
Yes, but the kernel faces slightly different conversion problems and has slightly different heuristics than the upper layers.
We had to fight this many times in mathcomp, but it is still a black magic thing. If the files are public, I can give it a look. Otherwise calling abstract
here and there to see how timing changes may be a way to reduce the problem.
One observation regarding this: my terms are all about Z and a mix of concrete large numbers (up to 2^64) and variables. As known e.g. the Prop comparisons in Z compute:
Z.le = fun x y : Z => (x ?= y) <> Gt
E.g. z<=100
computes (with cbv) to:
match z with
| 100 => Eq
| 0 | 95 | 63 | 79 | 47 | 31 | 87 | 55 | 71 | 39 | 23 | 15 |
91 | 59 | 75 | 43 | 27 | 83 | 51 | 99 | 67 | 35 | 19 | 11 |
7 | 93 | 61 | 77 | 45 | 29 | 85 | 53 | 69 | 37 | 21 | 13 |
89 | 57 | 73 | 41 | 25 | 81 | 49 | 97 | 65 | 33 | 17 | 9 |
5 | 3 | 94 | 62 | 78 | 46 | 30 | 86 | 54 | 70 | 38 | 22 |
14 | 90 | 58 | 74 | 42 | 26 | 82 | 50 | 98 | 66 | 34 | 18 |
10 | 6 | 92 | 60 | 76 | 44 | 28 | 84 | 52 | 68 | 36 | 20 |
12 | 88 | 56 | 72 | 40 | 24 | 80 | 48 | 96 | 64 | 32 | 16 |
8 | 4 | 2 | 1 | Z.neg _ => Lt
| _ => Gt
end = Gt -> False
If one has 2^64 on the right hand side, this is game over. So essentially I cannot ever call cbv except with very careful delta specs. For anything which involves simplifying Z arithmetic, I mirror the computation e.g. with Ltac2 and change the result (the change step is a bit hard in Ltac2 ...). cbn/simple sometimes work, but tend to take a long time.
Up to now I was assuming that what QED does is not affected by this. In the end "+" also expands if the left side is a literal, although I am not sure the internal term size grows with the size of the literal. The printed representation of say 2^n+x
grows exponentially with n, but I think it has a lot of reuse of nested terms, unlike what happens with "z<=2^n", which expands to a match over all numbers with are less than 2^n.
So might the presence of z<= large literal
in the terms be a problem?
Btw.: the growth with C code statement count is about s^10
.
The root of all evil, as shown in your trace, is handling application: when the system faces f x y z = f x' y' z'
then it tries to compare arguments pairwise, eg x = x'
..., and if one fails it unfolds f
and retries. It also tries to record the reduced forms, to avoid reducing the same terms again, but if the term is large, this may be counterproductive.
a mix of concrete large numbers (up to 2^64) and variables
well, this is exactly where the thing may derail
a "quick" test would be to "lock" the operator that computes. One way is to define a module and module type which expose a (body less) symbol and an equation revealing the real body. You may need to use the equation from time to time, but the kernel will not use it autonomously.
The first example which comes to mind: https://github.com/math-comp/math-comp/blob/8e859e628404c50c8191f30489abf4e799f46f8d/mathcomp/ssreflect/bigop.v#L534-L542
Indeed in mathcomp does not compute (even if it is just a fold over a list) because of this locking. One has to unlock it (we have a generic mechanism/lemmas to do that, see the canonical structure declaration just after the locking module).
The most astonishing side to it is that a really super messy equality proof between two largish Z terms (converting 32 bit modulo arithmetic to Z arithmetic) involving about 30 Coq Interval and 25 Gappa proofs type checks all in one in 2 seconds. So Master Guillaume: teach us!
The C proof doesn't really do any manipulations of the Z terms, it just sticks together terms. So the really super hard work takes 2s to type check and the easy part doesn't finish in 11 hours.
If we are right in the diagnosis, then it is really an heuristic which derails, it can become exponential for no good reason.
I think the tricky part is that Z is also used in internal logic of VST e.g. for manipulating symbol -> value mappings.
I never expected z <= N
to reduce to a term whose size grows as N
. That is mindblowing.
So, if Coq needs to check whether z <= N
and z <= M
are convertible and it unfolds <=
too early for some reason, then this might blow up.
Let me think what I can do. The issue is that the range proofs I do to remove all the mod 2^32
one gets from C arithmetic only work with the concrete numbers. I wonder if I could abstract all literal numbers in such a way that I can still apply the equality proof for the concrete numbers.
You can try playing with the Strategy
command. For example, if Coq is unfolding <=
or <
too early, you can reduce their priority with Strategy 1000 [Z.le Z.lt].
One thought: I try to make my proofs modular. I first have a proof that a C function computes some specific (possibly largish) term in Z. This term is part of the Spec of the C function and thus in the type of the proof goal. I then later proof that this term has certain properties, say approximates some real function to a certain precision.
I wonder if it would help to proof a more abstract goal right away, which would not involve any largish Z term, but instead some Real number stuff.
For example, the Gappa plugin contains Strategy 1000 [Generic_fmt.round]
, because Coq should only unfold round
if it gets desperate.
You can try playing with the Strategy command. For example, if Coq is unfolding <= or < too early, you can reduce their priority with Strategy 1000 [Z.le Z.lt].
Is this something the kernel takes into account during QED?
All the complicated Z stuff I put into a separate lemma. So in essence the C proof is just a forward recording of what the C code computes, then I apply an equality lemma to the specified Z term and QED.
Yes, it is taken into account by the kernel. When it needs to check that f x
and g y
are convertible, it will choose which of f
or g
to unfold depending on the level passed to Strategy
. (I do not remember how the default levels are computed, but the higher the less likely to be unfolded, if memory serves me right.)
That said, your idea of abstracting away all the 2^32
constants is a good one.
By default it unfold on the right, there is no pre computation of priorities based on, say, logical dependencies.
Wasn't there a precomputation based on arity or something?
Yes, it is taken into account by the kernel.
Interesting - I din't expect this.
That said, your idea of abstracting away all the 2^32 constants is a good one.
So in essence I could abstract all literals and hide them behind some opaqueness ID function with A very high strategy value?
No need for Strategy
in that case. Just using opaque terms might be sufficient to steer the kernel away from looking at them.
Yes but beware that Strategy or Opaque are hints, the kernel may unfold anyway. The module opacity I talked about is not a hint, it is final.
No need for Strategy in that case. Just using opaque terms might be sufficient to steer the kernel away from looking at them.
I see. So I abstract all numbers as soon as they appear (say move from C statements into the variable context) - or possibly even already in the C code before I start to look at it. Then I apply the equality lemma which unfolds the definitions since there is no other way, but it wouldn't hurt then since it is a 1:1 correspondence of terms. And then I QED?
Yes but beware that Strategy or Opaque are hints, the kernel may unfold anyway. The module opacity I talked about is not a hint, it is final.
Yes, this might also work - completely abstract away all operations instead of the numbers. Then I can convert the terms with equality lemmas.
Another question: does it make a difference for the kernel if I rewrite with an equality compared to converting the goal such that it becomes the equality I have and apply it? Imagine I have f(x) = f(y)
and a proof for x=y
. I could rewrite with the equality or I could apply f_equal and then the equality.
Does this make a difference for the workload during QED?
One more piece of symptomatic information: the stack depth is typically around 500 and it does go up from time to time to level 10 or so.
Michael Soegtrop said:
Another question: does it make a difference for the kernel if I rewrite with an equality compared to converting the goal such that it becomes the equality I have and apply it? Imagine I have
f(x) = f(y)
and a proof forx=y
. I could rewrite with the equality or I could apply f_equal and then the equality.Does this make a difference for the workload during QED?
Not really.
But beware that when we say "convert" we mean by change/simpl, not by apply/rewrite.
But beware that when we say "convert" we mean by change/simpl, not by apply/rewrite.
Yes, but applying lemmas or rewriting in the end also requires a conversion to check it. I can imagine that this should not be hard for exact
, but I am not so sure about rewrite
in largish terms.
Anyway, I will try the things we discussed above (first abstracting the numbers (should be easy), then abstracting the operations) and see if this helps.
I will also think about how I can get towards a reflective proof. If I understand this right the idea behind Guillaume's tools and also the core idea of ssreflect is to apply a lemma which converts the goal into something which completely computes, e.g. to true=true, so that the kernel's job is straight forward computation, which doesn't need any heuristics which could possibly go wrong.
To be clear, there is still a heuristic and it can still go awfully wrong. But since the heuristic is "reduce arguments before unfolding functions", it can be implemented in a very efficient way (that is the main motivation) and it is predictable (that is a bonus).
@Michael Soegtrop You describe precisely what reflection is, and cbv/vm_compute/native_compute do run fast!
To be precise, small scale reflection is (also) used on open terms, eg prove "(S x <= S y) = true" by applying H : "(x <= y) = true". Note that in this case the LHS does not compute to true, since it contains variables. These are still proof steps taken care by computation, but typically the length of the computation is very different.
Regarding Z.le
, it boils down to the following testcase:
Time Check (fun z => fun H : Pos.compare_cont Eq z 1000000 = Gt =>
H : Pos.compare_cont Eq (xO z) 2000000 = Gt).
(* 6 seconds *)
I guess Coq does not know if it should start by reducing the first or second occurrence of compare_cont
, chooses the first instead of the second, and then explodes.
That said, this is a bit strange, because after a step of reducing the first occurrence, it should switch to the second one, since the first one now starts with a blocked match. There is no reason for Coq to keep reducing the first occurrence.
So, there might be an actual bug in Coq.
remark: the kernel compares applications right-to-left
Here the order does not matter because the arguments are always in normal form, so their comparison is immediate, unless I am missing something.
Require Import ZArith.
Axiom z : positive.
Goal True.
Time unify
(Pos.compare_cont Eq z 1000000)
((fix compare_cont (r : comparison) (x y : positive) {struct y} : comparison :=
match x with
| (p~1)%positive =>
match y with
| (q~1)%positive => compare_cont r p q
| (q~0)%positive => compare_cont Gt p q
| 1%positive => Gt
end
| (p~0)%positive =>
match y with
| (q~1)%positive => compare_cont Lt p q
| (q~0)%positive => compare_cont r p q
| 1%positive => Gt
end
| 1%positive => match y with
| 1%positive => r
| _ => Lt
end
end)
Eq z 1000000%positive).
This is what you get after the unfold you expect, so we should look what the conversion does in the case Const | args
vs Fix | args
.
I guess it does not do the pairwise comparison which would be immediate...
Enrico Tassi said:
Note that in this case the LHS does not compute to true, since it contains variables.
But isn't it so that the the term one has at the point of Qed fully computes to e.g. true=true? It is clear that intermediate terms one has in the construction of this final term are usually open.
Michael Soegtrop said:
But isn't it so that the the term one has at the point of Qed fully computes to e.g. true=true? It is clear that intermediate terms one has in the construction of this final term are usually open.
No, not for proofs by reflection. When we get to the goal foo = true
, foo
is usually already closed. Indeed, if there are still some variables at this point, it means that they do not participate in the computation.
You may get this incorrect intuition from the fact that many theorems about type theory say that in an empty context ... For example a term of type nat is either O or S, but only in an empty context.
I meant the process of constructing the term foo
and relating it to whatever goal one has. During this process one sees all kinds of terms.
Enrico Tassi said:
I guess it does not do the pairwise comparison which would be immediate...
That is interesting. Also, I am a bit surprised that the code does not unfold the definitions on both sides if they are the same symbol. Keeping one folded is just delaying the inevitable.
Now I don't have time to dig, but I guess it pushes the fix on the stack, and put the literal in head, then fire a iota, obtaining a stuck match on z, then pushes the match on the stack and leaves z in head. Then it must go left and obtain the same thing. So my memory tricks me, because something more costly must happen in between.
And we should not forget that it has to compare the unfolded fix with y substituted in, maybe it is then that something goes wrong
given that y is concrete, it can always unfold the fix, even if x (z) is stuck
what is hard to see is why the two terms don't "align" sooner than later
If Coq chooses to reduce the fix
instead of unfolding the symbol, they can never align until the fix
is fully expanded. At that point, Coq has created a huge match
on x
and it has to check for convertibility on all the branches, hence the exponential behavior. That is my current theory.
My recalling is that there are checks to (quickly) compare the stack (the context under which the comparison is happening), but they just say "can't be equal", while here one would like to see "maybe better to reduce the other side, since here we already have so many match!"
Hum, no, that can't work here... sorry for the brainstorming.
I tried modifying the conversion checker so that it unfolds on both sides, when the symbol is the same. Unfortunately, Int63.v
no longer goes through. The calls to auto with zarith
no longer terminate then. That is strange.
In this case the good heuristic seems to be to consider both x and y recargs (which would weaken the conversion relation, but not that much)
(That does not fix the issue anyway.)
Int63 is extremely sensitive to unfolding heuristics.
I can't recount the number of hours I have lost trying to tweak the conversion algorithm, just to face an explosion in Int63.
For good or bad reasons?
we can always rewrite that file, if it is just that file...
There is essentially a fixpoint that unfolds to a 2^32 case analysis.
I believe this can be worked around but I'd concerned about end-users facing the same kind of issues.
(and thus breaking backwards compat)
I think it would be much better if "a fixpoint that unfolds to a 2^32 case analysis" would never work, and force users to find a better way. What is a pain is that sometime is works
But I don't understand why unfolding the same head symbol on both sides could cause such a divergence.
Because 2^32 is huge?
@Enrico Tassi if you want to implement a workaround, I think you can try changing the L2R comparison order and compiling Int63.
Loop guaranteed, 100% satisfaction.
What I mean is that I modified the checker so that, when it encounters f x = f y
, it unfolds f
on both sides rather than on one side only. How could unfolding only one f
succeeds?
if you're in a partial fixpoint unfolding, I believe.
Imagine you have f := fix F := ..., and the LHS is the one-step unfolding of the RHS.
but unfolding the LHS leads to a huge fixpoint.
I can probably cookup an artificial example.
No, I am talking about the constant case. Both sides have to be unfolded, otherwise they can never match (right?).
@Guillaume Melquiond it eventually also unfolds the second f
I think you code is correct, but changes the order in which things happen
so it may just be slower
Pierre-Marie Pédrot said:
but unfolding the LHS leads to a huge fixpoint.
Okay, I see. So, the fact that it currently works is completely improbable, because, if Coq ever decides to unfold the LHS before the RHS has reached the exact term of the unfolded LHS, then an exponential blowup occurs.
"Le salaire de la peur."
That said, I still do not understand why Enrico's example does not get through instantly. As far as I understand, the algorithm first puts the terms into a very weak hnf, which should give something like compare_cont ...
vs z | Case ...
. At this point, Coq unfold the head of the LHS and puts it into hnf, i.e., z | Case ...
. The heads match each other, so Coq moves to checking the convertibility of the branches. But they should be syntactically equal at this point, so I do not understand where Coq is wasting time.
there's no syntactically equal fast path once you're reducing
Oh, right! We are no longer encountering a single symbol application, so there is nothing to prematurely stop the full hnf.
Maybe
"maybe better to reduce the other side, since here we already have so many match!"
was not so silly after all
once you have an irreducible match in your context, you may start reducing the other side until you find a match
Using the type pos_bool = 1 + 2 * pos_bool
instead of positive = 1 + positive + positive
fixes the exponential blowup, since it reduces the number of recursive branches in compare_cont
from 2 to 1. Unfortunately, that is not a sustainable solution, since the conversion from positive
to pos_bool
is costly, allocation-wise.
Enrico Tassi said:
once you have an irreducible match in your context, you may start reducing the other side until you find a match
That is what happens, somehow, except that you now have two branches to compare, which in turn means that you have four branches to compare, and so on.
oh, I see, you are off by one match anyway so you have to continue
I have one more question for debugging this: does abstract
do an immediate Qeq - that is it is technically the same as proving a lemma outside of the current lemma, Qed it and applying it in the current lemma?
Not quite, but as a first approximation it's reasonable to believe that.
There is complex machinery in the kernel to make a best effort to behave that way.
Thanks!
Pierre-Marie Pédrot said:
There is complex machinery in the kernel to make a best effort to behave that way.
Last time I looked at "abstracted" terms, I thought I was going crazy. It took me a while to understand that abstracting was no longer creating an intermediate lemma unless the proof is terminated by Defined
.
Indeed, but the kernel skips the type-checking of the corresponding subproof, so it behaves the same in terms of conversion problems and the like.
Indeed, but the kernel skips the type-checking of the corresponding subproof, so it behaves the same in terms of conversion problems and the like.
Now I am confused. Does it type check the sub proof right away or not?
Yes, it does.
Pierre-Marie's sentence should be understood as: The kernel will encounter the exact same conversion checks as if the subproof had been put in a separate lemma.
(Even if it is actually put in a separate lemma only if Defined
is used.)
So, 1. the abstracted subproof is immediately checked, 2. the wrapping proof is checked at Qed time, 3. the abstracted subproof is inserted in the wrapping proof but the resulting proof is not checked at Qed time.
Thanks Guillaume!
For debugging this, I try to define parts of a proof as axiom. For this I need to cut and paste terms, which doesn't work in the presence of printing only notations.
There was this PR merged in 2017 (https://github.com/coq/coq/pull/703) to do this. Did a similar feature remain? Close Scope does not seem to have the effect of disabling notations for printing.
Switching off all notations is not much of an option in presence of large numbers.
Coming to this discussion late, but in my experience (as confirmed here), slow Qed is almost always either (1) conversion not recorded in the proof trace (using simpl
/cbn
, vm_compute
/native_compute
in the hypotheses rather than the goal, change
, or more rarely using cbv
or lazy
with a blacklist/whitelist), (2) issues with too many universes, or (3) the guard checker being slow. The "quick" fix, once you've tracked down the offending simpl
(or whatever) (which you can do by adding Axiom admit : forall {T}, T
and then trying all: exact admit. Unshelve. all: exact admit. Qed.
at various points in the proof, and seeing when it first gets slow), is to replace the call to conversion with something like simpl
with match goal with |- ?G => let G' := (eval simpl in G) in replace G with G' by reflexivity end
(replacing reflexivity
with abstract reflexivity
if you want).
(Is it worth sticking this explanation on some wiki somewhere?)
@Jason Gross I believe so
@Pierre-Marie Pédrot Do you have a suggestion of where to put it? Is there some wiki of "Coq Debugging Tips" or "My Coq is Too Slow, What Should I Do?"?
somewhere around https://github.com/coq/coq/wiki/Performance probably?
there is https://github.com/coq/coq/wiki/Troubleshooting also which mentions slow Qed
@Jason Gross : thanks a lot!
So if I got you right any sort of conversion with change, cbv, cbn, ... in goals is OK, but in hypothesis it might be harmful? Does this also apply to the value part of local definitions or only to the type part?
E.g. the VST automation abstracts certain parts of the goal into local definitions, mostly to keep the goal readable and then does quite a bit of conversions in these local definitions, e.g. symbol -> value mapping management, but as far as I can tell mostly in the values of the definitions and not in the types. Could this possibly be harmful?
Another though: afaik usually a change is more efficient in the goal than a rewrite. Does this also apply to hypothesis or could the type checker make use of rewrite equalities to steer the process? In the end an equation would be more rigid than a conversion, even if the left and right hand sides are convertible. So might a "rewrite x with y in H by cbv" be better than a "change x with y in H"?
So if I got you right any sort of conversion with change, cbv, cbn, ... in goals is OK, but in hypothesis it might be harmful?
Not quite. vm_compute
and native_compute
in goals is okay. In hypotheses it is almost always wrong to use vm_compute
and native_compute
. cbv
and lazy
are fine (in all contexts) as long as you issue the right Strategy
commands to cause kernel conversion to unfold your whitelists before anything else, and to unfold your blacklists after anything else. Hope will be lost if you use contradictory whitelists/blacklists in the same proof, even if you use abstract
(at least if you want coqchk
to handle your vo file; the checker sees Strategy
as being per-top-level proof). Finding the right strategy commands for simpl
/cbn
is a black art I have not even begun to master, so I suggest avoiding them entirely when performance matters. All conversion tactics are fine if the term you're converting is small enough.
E.g. the VST automation abstracts certain parts of the goal into local definitions, mostly to keep the goal readable and then does quite a bit of conversions in these local definitions, e.g. symbol -> value mapping management, but as far as I can tell mostly in the values of the definitions and not in the types. Could this possibly be harmful?
Yes, it's risky, though in general the overhead comes from unfolding recursive definitions in the wrong order, so as long as you don't have deep recursion in any of the definitions here (i.e., all of your maps are small), it should be fine.
Jason Gross said:
In hypotheses it is almost always wrong to use
vm_compute
andnative_compute
.cbv
andlazy
are fine (in all contexts)
I would say it differently. It is always (!) wrong to use vm_compute
and native_compute
in the context and this is noticeable. Similarly, using cbv
and lazy
in the context is just as wrong (for the same reason: no transtyping ascriptions are put in the proof term for operations on the context), irrespective of Strategy
. But more often than not, the issue is not noticeable.
Another though: afaik usually a change is more efficient in the goal than a rewrite. Does this also apply to hypothesis or could the type checker make use of rewrite equalities to steer the process? In the end an equation would be more rigid than a conversion, even if the left and right hand sides are convertible. So might a "rewrite x with y in H by cbv" be better than a "change x with y in H"?
It applies to hypotheses as well.
change
is more efficient, until it slows down your Qed
. At that point, you can either find out the right sequence of transformations that the kernel will compute efficiently, doing refine (_ : new_goal_type)
repeatedly to leave evidence in the proof term, or else you can give up and use the sledgehammer of rewrite
. Note, btw, that the kernel can be sensitive to which term is on the left and which is on the right, so if you're dabbling in this arcane art, the first trick to learn, IIRC, is that if your hypothesis has type G
and you're converting it to be a thing of type G'
, then refine (((fun v : G' => (v : G)) _)
and refine ((fun (f : G' -> G') => (f : G -> G')) (fun x => x) _)
can have drastically different performances because, IIRC, the first one will result in the conversion problem G' ≡ G
while the second one will result in the conversion problem G ≡ G'
. Example:
Fixpoint fact (n : nat) := match n with 0 => 1 | S n' => n * fact n' end.
Axiom P : nat -> Prop.
Check (fun (H : P (id fact 10))
=> (fun (f : P (fact 10) -> P (fact 10)) => (f : P (id fact 10) -> P (fact 10)))
(fun x => x)
H)
: P (id fact 10) -> P (fact 10). (* fast *)
Check (fun (H : P (id fact 10))
=> (H : P (fact 10)))
: P (id fact 10) -> P (fact 10). (* very slow *)
@Guillaume Melquiond
Similarly, using cbv and lazy in the context is just as wrong
You can drop in the context
here. cbv
and lazy
do not insert transtyping ascriptions even when used in the goal.
Really? Then that is a bug. Because that is the only reason why REVERTcast
exists in Coq, as far as I know.
I just tested with Coq 8.9. The type ascriptions are there, fortunately.
Hm, how can you tell? cbv. Show Proof
does not show any cast nodes in the proof term.
Did you try Set Printing All
?
Ah, no, I was not aware that Set Printing All
is needed, given that DEFAULTcast
does not need Set Printing All
and is printed identically to REVERTcast
when Set Printing All
is enabled.
Yes, having to disable notations just to see reverse casts is a bit painful.
It seems strange that change
does not leave a REVERTcast
Agreed.
I've written a fleshed out guide in https://github.com/coq/coq/wiki/Troubleshooting#what-can-i-do-when-qed-is-slow (comments, suggestions, and improvements welcome) and reported https://github.com/coq/coq/issues/13984 and https://github.com/coq/coq/issues/13985
@Guillaume Melquiond
But more often than not, the issue is not noticeable.
Is this so because vm_compute / native_compute usually do much more complicated conversions than cbv, so that it is harder for the kernel to type check later, or is this due to a technical difference?
The former, that is, if vm_compute
is not performed at Qed
time, you will notice it. But if cbv
is done differently at Qed
time, that is much harder to notice.
@Jason Gross : thanks for writing this! Very helpful, both the background and the concrete hints.
One note: the section (https://github.com/coq/coq/wiki/Troubleshooting#conversion-not-recorded-in-the-proof-trace) seems contradictory - I guess due to editing after the feedback from Guillaume. First you say that simpl & Co don't leave a cast and a few lines later you say they do.
@Michael Soegtrop Glad it's helpful! I've adjusted the text a bit, hopefully it's more clear now. In fact simpl & co do leave cast nodes, but, unlike vm and native, these nodes are not enough to reconstruct the exact reduction strategy used in the proof script, and hence the kernel will almost always take a different path. For lazy and cbv this is often not too bad, especially when augmented with Strategy commands, but for simpl and cbn this is really insufficient. Is this all clear from the text now?
@Jason Gross : yes, thanks! It is very clear now. Possibly the background knowledge you present should make its way into the reference manual.
fwiw I recently also had a problem of slow Qed time when defining a couple of large fixpoints (also when defining them non-interactively where I'm not sure if we can call it Qed time anymore) which turned out to be neither conversion nor the guardedness checker -- as Time Guarded.
just before Qed was very fast.
But it was because I had not indicated the structurally decreasing argument anywhere which apparently takes a long time to infer since adding the annotation reduced compilation time of a pathological coq file from more than an hour to 4 seconds.
Although I would have expected infering the decreasing argument is also the guardedness checker's job since from my understanding this argument is the object of interest that is checked for being guarded right?
The wiki page linked by @Jason Gross seems to be publicly editable. I will then probably add that tidbit. And does anyone know how one could trigger and time the inference manually so that it's better to debug when that is the cause of a slowdown?
That's very interesting, and yes, please do add it to the section about fixpoints! Could you extract the example into a standalone file and maybe report it as a performance issue on the Coq bug tracker? I'm also curious whether or not extruding the body, as I suggest in the section about fixpoints, changes the timing. Also, does permuting the arguments change the time of the Guarded
call? It seems plausible to me that Guarded
picks a different order for guessing the structural argument than Qed. (Also, I'm guessing that there's some argument that if you declare that one to be structural, Coq takes a long time to figure out that it's not?)
it seems Guarded does no guessing, instead it just uses the last argument
Fixpoint F (x y:nat) : nat.
Proof.
destruct x as [|x].
- exact 0.
- exact (F x y).
Fail Guarded.
Show Proof.
(* (fix F (x y : nat) {struct y} : nat := match x with
| 0 => 0
| S x0 => F x0 y
end) *)
Qed.
IIRC we wanted to forbid tactic mode Fixpoint without struct
, which would ensure Guarded is only used when it works
but I don't know if any specific person has decided to make it happen
@Jason Gross @Gaëtan Gilbert : can you please summarise the status of change
after the changes done since this has been discussed above? Is it safe (in terms of QED time) to use change
now on the goal, or should I still use e.g. replace a with b by reflexivity
instead? I guess any direct use of reductions tactics on the type of hypothesis is still a reason for QED slowness.
Is it safe (in terms of QED time) to use change now on the goal
Was it unsafe at some point?
@Gaëtan Gilbert : well this is what the Coq Wiki (section written by @Jason Gross) says:
The tactics simpl, cbn, lazy, cbv, compute, hnf, unfold, and red (but not change, see Issue #13984) when used on the goal only will leave a special kind of cast in the proof term
@Gaëtan Gilbert : I saw your edit to the wiki . May I interpret this such that using change
still does not leave a cast in the proof term, so that changing a term with a computed form of it in the goal can lead to QED slowness?
https://github.com/coq/coq/issues/13984 is still open but now there's an MR from Gaëtan https://github.com/coq/coq/pull/17271
yup
After some experiments I found that the QED slowness I am experiencing is due to cbv
with explicit delta list. If I set Strategy expand
for exactly the same symbols as I have in the delta only list, the QED is fast. Note that using just Strategy
without an explicit delta list does not work, since I would have to set strategy opaque for all other symbols.
I wonder what might be a reasonable way to handle this. I could use with_strategy
around an abstract
as recommended here - the only issue is that I frequently use computed delta lists and as far as I can see neither Ltac2 nor Elpi do have a with_strategy
tactic which would take a symbol list.
I wonder if we should have an abstract
tactic, which takes a reduction strategy and converts this to strategy expand flags for the kernel.
isn't it redundant with with_strategy
?
Not knowing how this is implemented I am not sure what is efficient. Apparently I anyway need a redundancy between delta lists I give to cbv
and strategy commands I give to the kernel. Since I have explicit delta lists, I don't need the strategy for cbv
- I only need it for the kernel. So I thought it might be more efficient to give it to abstract
.
I also wonder if we should have a strategy option for QED.
In the end I want to have an effective way to get cbv with explicit delta list through the kernel without having to specify my symbol lists several times.
Last updated: Oct 01 2023 at 17:01 UTC