We have a "Performance" project: https://github.com/coq/coq/projects/13 How many of these are still relevant? Should we create issues for them? cc @Pierre-Marie Pédrot
still relevant AFAICT
I've converted them to issues, could you elaborate somewhat further?
In case anyone is interested (@Pierre-Marie Pédrot?), the 10 Years of Superlinear Slowness talk I gave at the Coq Workshop is up on YouTube (slides.pptx, slides.pdf, abstract.pdf)
slides.pdf link is wrong.
Oops, thanks, I've edited the original message to fix
@Jason Gross : one question on this: a while back I switched to reflexive automation - that is I create an abstraction of the accepted class of goals
A, a function
I : A -> Prop which interprets the abstraction and gives the original goal, a function
C : A -> bool which computes if the goal is true and a lemma
CI : forall (a : A), C a = true -> I a. Then I compute
a with a tactic, apply the lemma
CI a and call vm_compute which results in
true = true.
The only thing which in my experience can possibly be slow here is the unification of
I a with the original goal - one has to design A and I carefully such that a well defined computation (cbv with explicit delta list) on I A matches the original goal exactly (or at least trivial to unify) - then this step is instant. With the reduction of
C a to
true I am usually more astonished how fast it is than the other way around - even with compute instead of vm_compute.
Besides being fast I made the experience that such automation is easier to maintain, since to a certain extent it is proven correct by lemma
CI - I usually don't prove the backwards direction
I a -> C a = true, though. It is work to adjust the lemma when the tactic is extended, but I spend much less time debugging the use of tactics and definitely didn't spend significant time optimizing the performance of tactics (~ 1%) since then.
I wonder what your opinion on this is.
(Note that computing
a with a tactic can also be slow if you're not careful.)
This is the approach that I call "throwing away the proof engine". This is fine when you are working in decidable domains and writing hammer tactics, but not very good if you want modular automation, in my experience. For example,
rewrite lem by tac is very hard to code in this style, not just because it's a lot of work to write a completely generic rewriter, but also because it's hard to interface between arbitrary classes of reflective tactics.
Rtac solves this problem to some extent, describing how to write a fully reflective tactic engine that allows interfacing multiple reflective tactics without having to re-reify in between, but AFAIK it doesn't actually solve the underlying issue of engineering a proof engine that's performant when you have an enormous number of small steps, i.e., things that are not hammer tactics fully solving everything in their scope
Can you elaborate on the issue with
rewrite lem by tac?-I do this all the time e.g. with
tac=lia but also with much more complex own tactics. I see the issues with
rewrite in general, but I don't see the issue with chaining rewrite with elaborate tactics.
rewrite produces goals and tactics dedicated to these kind of goal prove these. I can also put a wrapper tactic in between which analyses the goal and chooses a suitable tactic in case rewrite produces many goals of various kinds.
Please also note that if one does not prove the backward direction, one can use reflexive techniques also for heuristic tactics - If the checker function returns true fine, if it returns false it simply means
I can't prove this.
I sometimes combine reflexive tactics on the Gallina level with a record abstracting sub proof engines - that is in the abstraction
a:A I have a record with a
b:B and the
I functions for
B - the corresponding functions for
A just call these. I find this reasonably straight forward.
I guess your stuff is just more complicated than mine (I don't have terms with 1000 binders) but then I wonder how much your statements generalize to average Coq. What would help me (and possibly others) to understand this is a simple example where reflexive techniques fail.
setoid_rewrite is too slow and you want to write it reflectively (e.g., as it was for us in this talk (paper, slides.pptx, slides.pdf). Furthermore suppose that there are some lemmas whose side-conditions you want to solve with
lia, some that you want to solve with
nia, and others that you want to solve with
rtauto. Importantly, if one of these solvers fails to solve the side-condition, the
rewrite should not happen. This is mostly doable, except for the fact that
zify is not reflective and so you have to rewrite it reflectively (and also possibly augment your reflective rewrite tactic with support for managing the assumptions available in the context...). But if you want
tac to be some non-reflective tactic (smtcoq's hammer or something), you're sunk.
I'm not claiming that reflective tactics will ever fail to be a solution. I'm claiming that the limit of using reflective tactics for everything is that your reflective language becomes TemplateCoq, your denotation function becomes MetaCoq, and your reflective automation involves porting the entire tactic engine (evars, goal state management, backtracking, evar map instantiation by side effects, etc) to MetaCoq. Because you shouldn't have to redo your reflective
rewrite tactic (or
firstorder) every time you want to be able to handle another domain of side conditions. And by the time you get to an adequately general
rewrite tactic, you've recreated all the bottlenecks of the existing tactic engine (or you've managed to design and implement an adequately performant tactic engine --- in which case there was no need to do it reflectively in the first case!).
@Jason Gross : thanks a good and clear example und explanation!
I mostly use a style where I have 3 layers:
I can see that if the proof automation gets deeper nested, I would have the troubles you describe.
Another big issue, I think, is when the middle layer is the bottleneck
Well I am also using VST floyd (rather deeply nested Ltac tactics) where this is the case to a certain extent. I am pondering since a while what would be the best way to redo it - your input is quite valuable to make the right decisions here. I considered reflexive because VST as is leaves the "3rd level" goals as new goals to the user - which makes sense for VST since these goals are generally application dependent. But then one might want to detect goals one can automate, and then one would be in trouble with a reflexive tactic - unless one wraps in Ltac2 and does the handling of 3rd level goals there.
But I also think that most of the slowness of VST could be solved by using cbv with explicit delta lists - it spends a lot of time in simpl and cbn - mostly to get goals into a form one can match on with
match goal but this also fails not that rarely because simpl does somethng different cause of something in the "user terms" emebedded in the VST terms. Frequently simpl also simply takes forever cause of embedded user terms. I frequently use "with_strategy" to get around such cases. But I think it would make sense to have better support for handling symbol lists - say "cbv with all symbols from module X" or so, so that one can use cbv with explicit delta with less maintenance effort.
@Pierre-Marie Pédrot : you started a solution for symbol table handling after the working group meeting - can you point me to the current state?
I haven't made much progress on this, but it's not very long to finish if I put a bit of energy in it. I'll have a look at it asap.
@Michael Soegtrop I've thought quite a bit about reflection over the years, and I do think there is some viability in it, even at scale, but Rtac's approach was a bit too limited. Reduction is a (maybe not the only) major issue that I see in it. @Janno had this thread https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs-.26-plugin-devs/topic/Controlling.20reduction.20.2F.20annotated.20casts which has some thoughts that he and I brainstormed together.
It seems to me that if
Qed is not slow, then the benefit of reflection is mostly about working around non-fundamental slowness issues in the proof engine, no?
@Jason Gross we've definitely seen issues where type checking is the slow thing, but I'm not sure if that is fundamental. My understanding is that systems like LEAN get away with constructing big proof terms by caching type checking (and some reduction?). Something like that might fundamentally change some of the tradeoffs that we've seen.
Another factor is that we have @Janno who really wishes he could be writing in Mtac2 all the time.
@Jason Gross : for me the main rationale behind switching to the reflective approach for complicated stuff was maintainability. The reification tactics are fairly trivial and the rest is proven correct. I spend substantially less time debugging tactics or adding new features to them than before.
But I think there is also a fundamental reason why reflection is faster - with tactics one constructs a proof term and the involved terms are necessarily quite large and even an efficient handling of proof terms is somewhat slow. With reflection I just do a computations and apply an opaque lemma that this computation is correct. In addition I do a computation of the type of the proof term (reduction of the interpretation of the reified term to the original goal) but as far as I can see I never do any computation on the proof term as such - it is just the lemma that the checker on the reified term is correct.
@Gregory Malecha : I also had issues with unification speed but then I added a computation step into the interpretation of the reified goal which produces exactly the original goal. The unification is then trivial and very fast. Originally I just unified
I a with the original goal and this could indeed take from very long to forever. With the additional computation step this slowness is gone (unification time in the few ms area).
@Gregory Malecha for non-reflective automation, how often is typechecking (Qed) the bottleneck, in your expressive, vs other things in the proof script being the bottleneck?
@Michael Soegtrop Yeah, being able to prove things about your automation is nice. (Have you compared reflection with something like Mtac2? IIUC, you can prove some things about your Mtac2 tactics.)
I'm not sure I buy your argument that reflection is fundamentally faster. If the bottleneck were truly the size of the proof term, then we'd see this as a bottleneck in loading the proof terms from disk. (And I think @Pierre-Marie Pédrot wrote a patch to make opaque proof loading lazy, so perhaps this is the bottleneck you're hitting?) But in my experience the time spent in the proof engine is much, much, much longer than the time spent in loading vo files, or even in Qed. I also don't think I've seen many issues where the bottleneck is the size in RAM of the proof term, though that's a bit harder to notice I think.
In most cases, I think it's not that hard to find an efficient proof term (by efficient I mean linear in the number of steps that a reflective tactic would take). (The only exception I'm aware of is rewrite_strat.). Is your experience different?
I agree with the linear part, but you seem to underestimate the cost of full type checking with respect to plain reduction, and thus of the linear factor. Here is a testcase I had written a few years ago. It shows that reflection can be several orders of magnitude faster, even without using
Require Import List. Fixpoint max (l : list bool) := match l with | nil => false | false :: l => max l | true :: _ => true end. Lemma In_max : forall l, max l = true -> In true l. Proof. induction l as [|[|] l IH]. - easy. - now left. - intros H. right. now apply IH. Qed. Lemma stepr_In : forall l, In true l -> In true (false :: l). Proof. now right. Qed. Definition l := Eval vm_compute in Nat.iter 1000 (fun l => false :: l) (true :: nil). Theorem In_l_1 : In true l. Proof. apply In_max. exact_no_check (@eq_refl bool true). Time Qed. Theorem In_l_2 : In true l. Proof. do 1000 right. now left. Time Qed. Theorem In_l_3 : In true l. Proof. do 1000 apply stepr_In. now left. Time Qed.
Finished transaction in 0.001 secs (0.001u,0.s) (successful) Finished transaction in 3.325 secs (3.312u,0.012s) (successful) Finished transaction in 2.082 secs (2.081u,0.s) (successful)
those are quadratic terms vs linear reduction though. with
Notation s := stepr_In the In_l_3 term looks like
s l (... (s [false;false;true] (s [false;true] (s [true] (inl eq_refl)))) ...)
quadratic due to the repetition of list suffixes
So, it begs the question: Is it possible to make it linear?
Also, is it really quadratic? I would expect all the suffixes to be shared.
they are shared in the memory representation, but typechecking does not exploit this sharing
sharing in coq is useful for memory efficiency and the fast path in equality checking, but literally nothing else
my understanding is that Lean is much better at this, so this example may be linear time in Lean
Unless they use hashconsing, I don't see how they could. Type checking cannot inherently make use of sharing, because there is no sharing a priori. For example, the
[false;true] that appears in the type of
s [true] (inl eq_refl) is not the one that appears in the value
s [false;true] (s [true] (inl eq_refl)), since it is synthesized by the type checker.
@Jason Gross : my (intuitive) argument is that when constructing a proof from say applying 100 lemmas, one has to type check all these 100 applications - some of these might be more complicated than the original goal. With a reflective approach Coq just has to type check the application of one lemma + do some fairly straight forward computations. As I said my tactic reduces the application by cbv with explicit delta list upfront to make the type checking fast. As far as I understand in a proof construction tactic a substantial part of the proof term is custom made for the specific case and needs to be type checked for the specific case. For a reflexive tactic the proof term is generic and not really looked at since it is opaque - the case specific parts of the proof term are already reduced to true or the original goal type when it comes to type checking.
Possibly I am a bit naive here but this also matches my experience. I wrote some tactics in Ltac and Ltac2 constructing a proof and reflective. From a speed point of view Ltac and Ltac2 don't give each other much in my experience - sometimes one sometimes the other os faster - but Ltac2 is much easier to write and to maintain. But performance wise my reflective tactics are in the low 1 digit ms area, while the same in Ltac / Ltac2 - putting in total a comparable amount of effort in it - more in the few 100ms area.
I also experimented quite a bit with Mtac2, but I find it rather hard to write - I think I would need an Agda style editor for it. May be it is just a matter of experience - possibly also of documentation - but I am quite a bit faster writing a reifier and wrapper in Ltac2 and the Gallina functions and proofs than the equivalent in Mtac2.
Unless they use hashconsing, I don't see how they could
I think they do use hashconsing
they even write papers like https://arxiv.org/abs/2003.01685
@Guillaume Melquiond I believe the proof term can be made linear using
let binders, but due to https://github.com/coq/coq/issues/11838 Coq will not be able to do typechecking in linear time
@Michael Soegtrop My experience is that reification is much faster than Ltac1 and Ltac2, but not because the proof term is too big. Much of the time is spent constructing evar instances, doing fresh name generation (often introducing cubic factors), needlessly retypechecking terms, redoing occurrence checks, etc. If you really want to compare reflective vs non-reflective approaches at the fundamentals level, the thing to do is compare
Goal G. Time abstract (apply correctness_lemma; vm_compute; reflexivity). Defined. with
Definition take1 : G. non_reflective_tactic. Defined. Goal G. Time let v := eval cbv delta [take1] in take1 in abstract exact v. Defined. (i.e., measure just the time it takes to check the proof). Insofar as
non_reflective_tactic is slower than
abstract exact v, it is (I claim) because the tactic is being inefficient at constructing the proof term.
@Jason Gross there are still large efficiencies that can be gained in the proof engine. I completely agree with your analysis there. I am very interested in your thoughts for how this could be done, my current belief is that it requires sometime a bit more than fixing some performance issues, but I could be wrong here.
I think that part of my claim is that when performance matters, we should not be writing automation that looks like
apply X; apply Y; ... with
repeat and such. "Real" automated theorem proving relies on special purpose algorithms and data structures that do things asymptomatically more effectively, and if we want to get to the same level, we need to leverage similar techniques. When you do this processing out-of-line, in many instances it is much easier to write reflective tactics because you are taking big reasoning steps.
rewrite_strat. It isn't fast, but it is faster than a simple implementation in Ltac/Ltac2 (this would probably be ~200 lines). The reason (I claim) is that the implementation tracks things like "is this a definitional equality (vs a probable one)" and constructs a proof term using this information. I believe (though I'm not certain) that the algorithm only creates evars to satisfy the interface to
typeclasses eauto and does not use them internally. This avoids certain inefficiencies in the proof engine but it also avoids occurs checks and such that are (almost, LEAN has some special optimizations for closed terms) inherent in evars and instantiation.
@Michael Soegtrop yes, the casting is necessary to get good performance. Unfortunately, it doesn't work well for us because the casting is too coarse-grained. What I'd really like is the ability to have fine-grained control over reduction by using quotes and splices a.la. multi-stage programming. I think that this would be generally useful to anyone using reflection.
concretely, you probably have lemmas like:
Lemma ok : forall a b, tactic a = b -> denote b -> denote a
With finer-grained reduction control, you can change this into
Lemma ok : forall a, denote (tactic a) -> denote b
and still use
vm_compute or other fancy reductions.
the nicer thing is that you can avoid syntactic environments like:
denote env g with a
Var case that looks up in
env and instead just quotes (to stop reduction) the user terms that you don't want to reduce.
Var is also useful to recognize multiple instances of the same term though
eg if you have
foo - foo ring will recognize that it's the same
foo and simplify to 0 even though internally we can't decide equality of possibly open terms
that is true. you can still have an environment, the main thing is that you don't actually need to separate out the computation of the tactic from the computation of the resulting goal.
@Jason Gross : in my experience in
Goal G. Time abstract (apply correctness_lemma; vm_compute; reflexivity). Defined. the
apply correctness_lemma can take a long time because it can be non trivial to unify it with the goal. What I do instead is to pose the correctness lemma, do a cbv with only the definitions used in the interpretation function
I on the type of the pose and then apply it.
If I understand you right, you are saying that the construction takes most time in "take1" rather than in the final checking. I would say this depends on the situation. What I am saying is that any handling of a custom constructed proof term (either in construction or checking - in the end I am interested in the total) takes longer than using a generic opaque proof term where the dependence on the goal is just in a parameter.
@Gregory Malecha : I don't quite understand what you mean. As described above I have a
Lemma ok: forall a, check a = true -> denote a.. I pose
ok a, do a cbv in its type to make it equal to the original goal, apply it and then do a compute on the boolean premise. What I am missing most here is an easy way to restrict the cbv on the type to the symbols used in denote, but help is on the way.
@Michael Soegtrop I see. A lot of the reflective procedures that we write simplify the goal rather than solving it. In this case, you want a lemma like the one that I wrote. (chapter 4, specifically 4.3.1, discusses this https://gmalecha.github.io/publications/files/malecha-dissertation.pdf, apologies for self-citing)
@Jason Gross how much backtracking/proof search do you use? Naively, backtracking increases proof search time without affecting Qed time, since most time is spent on terms that aren’t checked at Qed time
AFAICT reflection might still help, but only when Gregory’s claim applies, that is, when specialized data structures are faster than using the proof engine
There are performance improvements that are possible for the proof engine that would reduce the cost of proof search significantly, and @Jason Gross has listed a few (and reported bugs on many more). These improvements could significantly increase performance on the "normal" use of Ltac and such.
There are (in my belief) a different set of features that would further improve the performance of reflective procedures. And, I believe, that when your algorithms depart more from the "standard" way of writing automation, these will become more and more useful.
However, I think it is worth stating explicitly that you can do both. It is not the case that one makes the other more difficult or impossible.
I ran the experiment that @Jason Gross mentioned above on the simple
is_even n -> even n vs
repeat constructor and got the expected result, i.e. that reflection is much more efficient. That is very much a toy example though so extrapolation from that is probably pre-mature.
@Gregory Malecha : I see. Thanks for the link to your thesis! One question: you write:
Unlike VM reductions, the customizations to call-by-value reduction are not included in the final proof term, so the kernel falls back on lazy reduction (which I discuss shortly) to perform the actual check. This can have a substantial impact on the proof checking time compared to the time it takes to build the proof.
Is this still true in current Coq? I wonder why the "cbv with delta white list" on the type of
denote a I do does not lead to long Qed times. Without it I had cases where the apply took minutes. With it the apply is instant and Qed also. Shouldn't Qed become slow if the cbv options are not recorded in the proof term?
Also I still don't understand in which way the reduction control cbv offers is to coarse grain. I understand the problem you mention in you thesis that one might use the same function in denote and in embedded terms (say Z arithmetic or list functions) - in this case it would be useful to create some sort of alias, say via a module. Not sure if this works - up to now I needed only very few library functions and lemmas on these in denote and simply duplicated these. But otherwise I don't quite understand where the issues are.
cbv is not in the kernel
the kernel only knows lazy, vm and native
And even saying that it knows lazy is a bit misleading, since the lazy reduction of the
lazy tactic and the lazy conversion of the kernel are slightly different things.
cbn, etc, all behave the same way. They store the current goal as a cast in the proof term and then reduce the current goal into a new goal by computation. (
native_compute are the same, but they additionally annotate the cast in the proof term to indicate that
Qed should not use the default lazy algorithm.) By having this extra cast in the term,
Qed might type-check it faster.
Thanks, I see - I thought the casts contain more information on how the reduction was done.
So one can expect that the cast is enough of a bread crumb for the kernel to get the before and after state unified efficiently? If yes, is this so because one always knows in this situation which side to reduce first (the "before" state)?
I remember having a lot of trouble with Qed time from reductions done in hypothesis, since these don't leave a cast in the proof term.
Gregory mentioned Mtac2 earlier in the thread so I should probably expand on the performance relevant bits of Mtac2. The main thing that pertains to the discussion here is probably the way Mtac2 does typed tactics. The first aspect is not so much about the "typed" part of it but rather the fact that evars for subgoals are not created by default. They are only created on explicit demand. That way proof terms can be constructed without incurring (some of) the overhead associated with the proof engine. Roughly speaking, in Mtac2's typed tactics,
apply f; exact x is basically as fast as
exact (f x) while constructing basically the same proof term. (And "fast" here means constant time although that relies on all types being known statically which is not always the case.)
Transferring this idea to
Proofview could be tricky. The only way I see is to create goal evars lazily in an implicit way. When a
Proofview function actually requires access to the evar, the evar will be created.
exact (in its stupidest form) would not request the evar to be created because it would simply check the given term against the type of the goal. I hope other tactics can similarly be equipped with fast paths for the case when there is not yet an evar. With a good set of primitives like that it might be possible to construct some interesting proof terms without creating intermediate evars.
Evars are not the only slow part of proof runtime, though. The "typed" aspect of Mtac2's type tactics certainly also helps by avoiding typechecking whenever possible. With some proof developments, a surprising amount of (often repeated) typechecking can be done once at tactic typechecking time. Typed tactics even make it possible to selectively typecheck subterms when a check cannot be avoided, which further reduces the overhead at tactic runtime. Unfortunately, I don't think this aspect can be ported to any other tactic language without turning it into Mtac2.
Lazy evar creation seems like it is a good idea! I don't know what impact it would have on performance though.
Not communicating via side effects was one of my favorite parts of the Mtac programming that I did.
Capturing this in Ltac2 would mean decoupling the goal state from the computation.
Decoupling the backtracking is the other thing that would be nice in some instances.
Fine-grained type checking seems like it could be something that is expressible in Ltac2 with a fancier quoting /anti-quoting mechanism but it would require tracking types on
constr values, e.g.
constr (ty : constr Type) which makes things more than a bit more complicated
You would probably need something like
constr dynamic to support what exists now
constr Type turns Ltac2 into Mtac2. Or starts evolving Ltac2's type system into GHC?
My proposal was a bit of a strawman that is easily burned. Contextual type theory for dependent types seems like an unsolved problem (from a preliminary survey). I wonder if experimenting by writing Coq tactics in MetaCoq would provide some insights. But this feels like thread hijacking at this point, so I started https://coq.zulipchat.com/#narrow/stream/237658-MetaCoq/topic/Tactics.20in.20MetaCoq
How does Mtac2 deal with binders? (What's the performance overhead of adding another binder to the context / going under binders?)
\nu (x: T), <tactic code that can use x> which adds
x to the current environment and removes it when the continuation returns. This comes with an occurs check to ensure that
x is unused in the result of the continuation. The usual way to return from the continuation is
abs_fun x t which turns
fun y => t[y/x]. This also incurs a check to ensure that the resulting function is well typed, although it is generally much cheaper than full (re)typing.
What's the cost of adding
x to the current context? (In Ltac(2) it's linear in the size of the existing context)
push_named in OCaml. I think that's constant but I never checked.
There can be extra cost because of the name
In fact, there might always be that cost. I don't remember the exact implementation but I think we need to ensure that the name is fresh even when a specific name is given.
Ah, you have to check that the name is unused, so that's a linear cost, but a small factor
(I think the issue with Ltac(2) is that you allocate a new evar (instance))
Are you talking about
in_context in Ltac2?
That one is absolutely horrible and to be avoided at all cost. I guess you already noticed..
And constr:(ltac:(...)) in Ltac1
There's not really a way to do typechecking without in_context in Ltac2, right?
(I did have to jump through a bunch of hoops to avoid typing open terms)
I haven't really found a close alternatives to
in_context. I construct terms entirely unsafely and manage the universe constraints that typing would insert manually.
Is there a way to do multiple \nu / abs_fun s that isn't quadratic in the number of binders?
\nus should be easy enough to implement on the OCaml side. (I'd love to have unsafe primitive so that I don't have to touch the OCaml code but it's not going to happen anytime soon..)
abs_fun is a different story. The correctness condition for that is not simple at all and it took Beta and I quite a few hours of whiteboard time to convince ourselves that the current implementation is sound. I am not sure that we can fuse checks of multiple
abs_fun without losing confidence in its correctness.
Do you think it's possible to implement a proof engine with only safe primitives with adequate performance?
*adequate asymptomatic performance
FTR pushing a name in the context in OCaml is log(n)
I do not know how to answer the question. We would probably have to come up with a list of primitives that are deemed necessary before I can attempt to answer it.
Mtac2 proves that some things can be done safely and efficiently. Fusing
abs_funs might be in the cards if we had more people/time to work on Mtac2. But it is probably not the only thing that would have to be optimized. I know a few others things that hold back Mtac2's performance that I would want to fix before I am truly happy with its performance.
Last updated: Jun 09 2023 at 07:01 UTC