I expected that reimplementing a Gallina function in a tail-recursive form reduces memory consumption and execution time of `vm_compute`

, but this seems not true according to my observation. Also, I found the following sentence in the ICFP '02 paper on `vm_compute`

:

Consequently, we present a minimal subset of the ZAM, omitting several aspects of the real ZAM that are

irrelevant for our purposes, such asthe optimization of tail-calls, the use of a register to cache the top of the stack, ...

I don't get why it is "irrelevant". Can someone explain it? Also, is the situation different in `native_compute`

? cc: @Maxime Dénès

Cc @Guillaume Melquiond

I think you are reading a bit too much in that sentence. It is "irrelevant" as far as this section of the paper is concerned, not in general. In fact, Coq contains all those features. For example, if you run the following, you will see that the recursive call is optimized (`appterm`

instead of `apply`

):

```
Set Dump Bytecode.
Fixpoint foo (l: list nat) acc :=
match l with
| cons h q => foo q (cons h acc)
| nil => acc
end.
```

That said, it is possible that the compiler in Coq is not as powerful as the one in OCaml and some optimization opportunities are missed. You should fill a bug report in that case.

@Guillaume Melquiond Thanks! I will check my definitions with `Set Dump Bytecode`

. Is there any other option or command that is useful to understand and debug `vm_compute`

?

There is also `Set Dump Lambda`

, but the result is so close from the original Coq that it is kind of useless.

After spending a few days for investigation and benchmarking, I could not identify the reason for my unexpected observation (benchmark results) mentioned above. So let me share my actual problem here.

I have four different implementations of mergesort. Two of them (`CBN.sort`

and `CBNOpt.sort`

) use the non-tail-recursive merge function, which should be suited to the call-by-need evaluation strategy. The remaining two (`CBV.sort`

and `CBVOpt.sort`

) use the tail-recursive merge function (as in `List.stable_sort`

of the standard library of OCaml), which should be suited to the call-by-value evaluation strategy with tail-call optimization. Ones with `Opt`

in their name try to reuse sorted chunks in the input (as in `Data.List.sort`

of GHC) and thus should be fast if the input contains many chunks that are already sorted.

My benchmark results (NB: I'm not entirely sure if my code for measuring memory consumption is correct) show that:

`CBN`

variants are significantly faster than`CBV`

variants when computing the first few elements of the output of`sort`

using the`lazy`

tactic (Figure 1),`CBV`

variants are two times faster than`CBN`

variants when extracted to OCaml (Figure 6), but unfortunately,- there is no apparent difference when we compare them using
`vm_compute`

~~and~~(Figure 4`native_compute`

~~and 5~~).

If someone is willing to help me, I really appreciate it and suggest looking at the tail-recursive merge function.

Unfortunately, in `revmerge`

, you are creating a new closure whenever you process an element of `x`

, which is costly. It would be much better if `revmerge`

and `revmerge'`

were mutually recursive rather than nested, but I do not know if that would still pass Coq's termination checker.

Thanks! Let me try to fix it, but then, another issue is that Paramcoq often does not work for mutually recursive functions (at least in my experience) :sweat_smile:

It seems to me that we cannot do this better unless we improve the termination checker (to recognize lexicographic orders) or `vm_compute`

, but it was really good to know the source of my issue. Thanks again.

Did you try a well-founded definition using a lexicographic order to compare?

@Matthieu Sozeau No. I don't think it gives us Gallina terms that are efficient in `vm_compute`

. Does it?

I think it can give you reasonable if not top performance. You can fuel the accessibility proof so that the cost of a recursive call is not too high, but indeed it will still create closures I believe.

OK. I will try that. Thanks.

Do report here, I'd be curious to know of it fares!

I agree that fueling the accessibility predicate would be just as costly as creating the nested fixpoints, if not more. (And not fueling it would be disastrous, performance-wise.)

Progress report: I reimplemented the `revmerge`

function so that it uses accessibility proofs instead of nested fixpoints. However, it seems to require a `match`

expression returning a function to typecheck, which may cause a performance issue in `vm_compute`

. Did I miss something? @Matthieu Sozeau

```
From mathcomp Require Import all_ssreflect.
Set Implicit Arguments.
Unset Strict Implicit.
Unset Printing Implicit Defensive.
Definition mergeord {T : Type} (p1 p2 : seq T * seq T) : Prop :=
match p2 with
| (x :: xs, y :: ys) => p1 = (xs, y :: ys) \/ p1 = (x :: xs, ys)
| _ => False
end.
Fixpoint wf_merge (T : Type) (xs ys : seq T) : Acc mergeord (xs, ys) :=
if xs is x :: xs' then
(fix wf_merge' (ys : seq T) : Acc mergeord (x :: xs', ys) :=
if ys is y :: ys' then
Acc_intro
(x :: xs', y :: ys')
(fun _ H =>
match H with
| or_introl H0 => eq_rect_r _ (wf_merge _ _) H0
| or_intror H0 => eq_rect_r _ (wf_merge' _) H0
end)
else
Acc_intro (x :: xs', [::]) (fun _ => False_ind _)) ys
else
Acc_intro ([::], ys) (fun _ => False_ind _).
Fixpoint revmerge
(T : Type) (leT : rel T) (xs ys acc : seq T) (wf : Acc mergeord (xs, ys)) :=
let: Acc_intro H := wf in
match xs, ys return
(forall xs', mergeord (xs', ys) (xs, ys) -> Acc mergeord (xs', _)) ->
(forall ys', mergeord (xs, ys') (xs, ys) -> Acc mergeord (_, ys')) ->
_ with
| [::], ys => fun _ _ => catrev ys acc
| xs, [::] => fun _ _ => catrev xs acc
| x :: xs', y :: ys' => fun H1 H2 =>
if leT x y then
@revmerge T leT xs' ys (x :: acc) (H1 _ (or_introl erefl))
else
@revmerge T leT xs ys' (y :: acc) (H2 _ (or_intror erefl))
end (fun xs' => H (xs', ys)) (fun ys' => H (xs, ys')).
```

You could avoid creating closures there by just passing an equality to the `match`

and rewriting, that might help in `vm_compute`

Thanks. Should I also unfold `eq_rect`

in the definition?

A naive question: would passing directly `wf`

to the match help instead of destructing it before the match ? (the resulting code is shorter but the inner match still returns a function type `Acc mergeord _ -> _`

...)

@Kenji Maillard IIUC, the recursive calls of `revmerge`

should take `x :: xs'`

and `y :: ys'`

instead of `xs`

and `ys`

then, which is another source of a potential performance issue.

I think eq_rect doesn't need to be unfolded. Matching on eq_refl is cheap hopefully!

Actually, you might see a different performance difference it you use the Fix_F combinator that does inversion on the Acc proof "in the branch" instead of outside the branch as you do here.

Currently, one that uses accessibility proofs is slightly slower and consumes 6 times more memory compared to the nested fixpoint version. I will put my code and benchmark results later. Also, unfolding `eq_rect`

significantly improved memory usage.

Aha, that's an interesting fact! Is it weak-cbv evaluation?

In general combinators as generated by Coq behave badly in call-by-value because they evaluate their arguments eagerly for non-recursive arguments.

Add atop of that the fact that they create closures when a pattern-matching would not and that's one more reason to expand them away.

But for equality it probably does not make much difference if you evaluate the `p : P x eq_refl`

and the `eq_refl : x = x`

before unfolding, no?

The problem is maybe coming from reducing the return predicate (but this should stop early if it's weak evaluation)

New implementations (ones we were talking about are `Module CBVAcc := CBV_(RevmergeAcc)`

): https://github.com/pi8027/stablesort/blob/489f630b2b20f1ad7e7307e6f2142635e7ccb32c/theories/stablesort.v#L132-L801

New benchmark results (see Fig. 7 for `vm_compute`

): https://github.com/pi8027/stablesort/blob/8c8fb23e49c2a4c0b4fce6dc4a44dc441206c0f5/output/main.pdf

Matthieu Sozeau said:

Aha, that's an interesting fact! Is it weak-cbv evaluation?

`vm_compute`

performs strong reduction, but it's done by repeating weak reduction and readback. So I guess yes. I think that the point of fueling an accessibility proof is delaying the reduction of terms under `Acc_intro`

. But currently, it rather seems to add some extra cost (it's not surprising though). Redefining `wf_mergeord`

may improve the situation, but I have no concrete idea on this point.

BTW, `CBVAcc.sort3`

extracted to OCaml seems to have a better performance than `List.stable_sort`

(but excluding the cost of GC). This is really impressive.

You might want to try the Acc fueler (Acc_intro_generator) at the end of https://coq.inria.fr/library/Coq.Init.Wf.html on top of wf_merge_ord to see if it helps.

Can you post the extracted code too? I'm just curious :)

One thing I don't understand in the bench is that extracted ocaml code seems to always be 10x faster than native compute, isn't it strange?

Matthieu Sozeau said:

Can you post the extracted code too? I'm just curious :)

Yes, here it is. https://github.com/pi8027/stablesort/blob/bd0ac4f06a79318cf46d7c12a934971915310467/benchmark/ocaml/mergesort_coq.ml

Matthieu Sozeau said:

One thing I don't understand in the bench is that extracted ocaml code seems to always be 10x faster than native compute, isn't it strange?

I don't think so. Firstly, the measured execution time of `native_compute`

includes the cost of invoking `ocamlopt`

. Since the input is now compiled before the measurement, it should be reasonably small, but we probably still have to compute the same thing many times in `native_compute`

to reduce its impact on the benchmark result. Secondly, IIUC, `native_compute`

has to pay some extra cost compared to extracted code so that it can readback the result, but the cost of readback itself should also be reasonably small since it computes `sorted N.leb (sort N.leb xs)`

. https://github.com/coq/coq/issues/8019

It'd be fun to try certicoq on this example

`native_compute`

also needs to handle the case of open terms: https://coq.zulipchat.com/#narrow/stream/237655-Miscellaneous/topic/native.20compute/near/231767343

I would have expected that in this case the overhead of handling accumulators for the open term case was low but I guess that's wrong and we're still doing some reduction under contexts in this case, maybe for the Acc_intro proofs.

`Acc_intro_generator`

does not seem to improve anything. I'm not going to try CertiCoq by myself for now but would like to see its result too.

I found the fact that the termination checker accepts:

```
Fixpoint merge (T : Type) (leT : rel T) (xs ys : seq T) {struct xs} :=
(fix merge' (ys : seq T) {struct ys} :=
match xs, ys with
| [::], _ => ys
| _, [::] => xs
| x :: xs', y :: ys' =>
if leT x y then x :: merge leT xs' ys else y :: merge' ys'
end) ys.
```

but does not accept:

```
Fixpoint merge (T : Type) (leT : rel T) (xs ys : seq T) {struct xs} :=
(fix merge' (T : Type) (leT : rel T) (xs ys : seq T) {struct ys} :=
match xs, ys with
| [::], _ => ys
| _, [::] => xs
| x :: xs', y :: ys' =>
if leT x y then x :: merge leT xs' ys else y :: merge' T leT xs ys'
end) T leT xs ys.
```

In the latter case, I imagine that the termination checker could be improved to recognize that:

- the argument
`xs`

of the inner recursion`merge'`

does not change and is weakly decreasing, and therefore - the argument
`xs`

of the outer recursion`merge`

is strictly decreasing.

Is this another possibility to avoid creating closures? Meanwhile, `Unset Guard Checking`

also seems to be an option to verify my ideas, although it is unsafe in general.

How about a single fixpoint using the sum of the lengths of both lists as measure?

@Michael Soegtrop That requires traversing input lists twice and also another pattern matching on the measure. So I think it rather adds some extra costs for computation.

Ah yes, I keep forgetting that the measure proof is actually used during the computation - Guillaume explained it to me once, but the space for such details of logic is limited in a physicists brain. I demonstrated it in the below example, where I used an axiom for all proof obligations of the well founded induction. I hope I can remember it now.

With "traversing the input list twice", do you mean the additional traversal in the length terms of the measure proof?

I wonder what kind of extensions to the logic one would need to allow well founded recursion which does not need the measure proofs during recursion. It is not terribly elegant that it is not sufficient to have a proof that a fixpoint terminates but that one needs to evaluate this proof during recursion. I guess one cannot prove this with additional axioms, since axioms are usually not helpful for computation. For exporting proven correct algorithms this would be important.

```
From mathcomp Require Import all_ssreflect.
Require Import Program.
Axiom admit : forall {T}, T.
Obligation Tactic := exact admit.
Program Fixpoint merge' {T : Type} (leT : rel T) (xs ys : seq T) {measure (length xs + length ys)} :=
match xs, ys with
| [::], _ => ys
| _, [::] => xs
| x :: xs', y :: ys' =>
if leT x y then x :: merge' leT xs' ys else y :: merge' leT xs ys'
end.
Eval vm_compute in merge' Nat.leb [:: 1; 4] [:: 2; 3]. (* -> a long term containing "admit" *)
```

Does extraction erase that argument?

(I expect so)

With "traversing the input list twice", do you mean the additional traversal in the length terms of the measure proof?

I thought it has to compute `length xs + length ys`

but this is probably not true if we are talking about the `Program Fixpoint`

command. But instead, it has to traverse `xs`

and `ys`

to produce the accessibility proof. (Right?)

Then I have a similar variant of the `merge`

function here, where everything is written by hand instead of using `Program Fixpoint`

: https://github.com/pi8027/stablesort/blob/8c84308dbc2ecb1981a43972c7fa31f1fd29f030/theories/stablesort.v#L173-L233.

As @Enrico Tassi says, extraction erases the accessibility proofs: https://github.com/pi8027/stablesort/blob/b65c800c39911695c7b6ee3f48013e397c4133c5/benchmark/ocaml/mergesort_coq.ml#L277-L295.

Ah ok, extraction removes this. Still it would be nice if vm_compute would also remove it - I am a fan of computing stuff in Coq. If we have constructive logic, why not use it?

I you prove the measure, and `Define`

d it, then it should compute... Axioms have no computational content, even in a constructive logic ;-)

Using the "fueling" of accessibility proofs (see the end of Wf.v), you can avoid any more computation of the measures/accessibility proofs

What it should amount to is getting every `Acc_intro`

in constant time, but it involves closure creation, so that would be the culprit in this case too

`vm_compute`

cannot remove it as you might be lying that you have an accessibility proof for the total relation for example, and in that case it could just not terminate

@Enrico Tassi : what I meant is that it shouldn't need to look at the proof at all for just evaluating the fixed point. So it shouldn't matter if the proof is an axiom or whatever.

This breaks a lot of stuff, e.g. if your accessibility proof uses an axiom you lose decidability

Hum, there is something wrong in your intuition. Even if you just write `{measure ...}`

the actual term contains a `match`

on the proof term. That match is erased by extraction, since it is on a "proof" (a term like all the others ones, really, but labeled as `Prop`

and not `Type`

). This erasure has a justification (eg I think they prove it makes sense in MetaCoq) but it is not internalized in the logic. Sure, vm_compute or native_compute could do some erasure if the user asks so. But their results can also be used in proofs, as evidence. And the fact that your term `admit`

is not erased is important. I don't know if it would make the system inconsistent, but would surely kill some models.

This is the reason why SProp doesn't allow elimination of Acc into Type, btw

@Enrico Tassi allowing erasure is going to break SR

also you're going to get weird behaviours if you start adding choice axioms

What about improving the termination checker?

you're going to get a very powerful type theory

probably not supported by any model

after @Kenji Maillard 's discovery of guarded functions not valid in the exceptional model we already conjectuer that Coq's guard condition cannot be eliminated to combinators

(for a good notion of "being able to eliminate")

pointer plz

a bar discussion in Nantes followed by whiteboarding

ok, then tell me which round of drinks was it

the problem lies in the fact that Coq considers empty matches to be subterms of anything

but in the exceptional model, empty matches are not empty

so if you were able to eliminate this guarded function to eliminators you'd get a non-terminating term

obviously this only happens in an inconsistent setting but Coq semantic goes under binders so you're in trouble

I can formalize the example easily, let me write that

is the "exceptional" model important?

well, this shows that the guard condition is *already* stronger than eliminators

so it depends if you see this as a bug or a feature

looks more like a bug to me ;-)

thanks for explaining

the problematic term is `Fixpoint dummy (n : nat) (e : False) {struct n} : unit := dummy (match e with end) e.`

in the exceptional model it becomes something like

```
Inductive Falseᴱ := FalseE.
Inductive natᴱ := Oᴱ | Sᴱ : natᴱ -> natᴱ | natE : natᴱ.
Inductive unitᴱ := ttᴱ | unitE : unitᴱ.
Fixpoint dummyᴱ (n : natᴱ) (e : Falseᴱ) {struct n} : unitᴱ :=
dummyᴱ (match e with FalseE => natE end) e.
```

obviously this is not well-founded

since the exceptional model interprets these eliminators, it means that this cannot be obtained through eliminators, otherwise the exceptional model would be non-terminating, but it inherits this property from the target

(this is not a formal argument, and also the exceptional model does not interpret Prop singleton elimination, so YMMV)

I see your example is problematic but still do not get how it is related to my example.

commutative cuts are problematic in general, because they bring in some context

it's the typical stuff not validated by combinators

the required equations on that fixpoint are `dummy 0 e == dummy (match e with end) e`

and `dummy (S _) e == same thing`

this is satisfied by `Definition dummy n e := False_rect unit e.`

no?

this example is a bit degenerated because it returns unit, but you can probably meddle with it so that you get a fixpoint that doesn't exist in CIC

(well, the fixpoint exists extensionally because you're inconsistent but it computes in a way that CIC doesn't allow)

"probably"

what saves you here is that you don't observe the n argument, but if you start using it you'll get some nasty commutative cuts

the guard condition internalizes these commutative cuts, but they don't hold in all models, that's just my point

One thing that could improve the current situation is this: if you start from a closed term (no lying about accessibility proofs), then it is fine to use `CertiCoq`

and compute a weak-head-normal form of your term, after erasure. It will be guaranteed to terminate and give the same results as doing a conversion check with vm_compute (assuming your term is in an inductive type and, hereditarily, the arguments of the constructors are inductive (no functions)).

I expect that for reflection proofs this could be quite competitive with the existing vm/native conversions.

In the subset where you don't need "strong" reduction

@Enrico Tassi : I understand that this is so in Coq, but I am not enough of a logician to understand why it has to be like this. I understand that recursive functions need termination proofs, but I don't understand why one can't have not too complicated setup e.g. using lazy evaluation of proof terms where the proof terms are never evaluated when they don't occur in the final result for a specific case. Of cause I don't want to use axioms - this was just an example to see what happens and how the evaluated term looks like without embedded complicated proof terms. As I said, I am a bit naive in such things, as I never studied logic.

a.k.a. geometric logic

@Michael Soegtrop even with lazy evaluation you do need to look at the accessibility proofs to gradually peel `Acc_intro`

constructors at each recursive call. The source of this is that `(fix f a := b) c`

should reduce only if `c`

's head is a constructor (or evaluates to one). This check is no longer needed in the erased evaluation.

I don't think it has to be that way, but it is how type theory is wired up :-/

yes, I see that well founded recursion is defined like this - I just wonder if there are other ways to define it, so that the termination proof can be more easily abstracted away.

No that I know of. Naive question: what would break if we "internalize" Markov principle in this very specific case (I mean, you define something by Acc and you also get the erased term)?

this is something I wondered about

Pierre-Marie Pédrot said:

commutative cuts are problematic in general, because they bring in some context

Still do not fully understand it, but it looks related to the fact that the latter example here is obtained by lambda lifting, and also computing with the former example creates closures. Thanks anyway.

(ftr it shouldn't be called Markov's principle, it's rather indefinite description or something)

you probably lose decidability of type-checking if you don't keep the term around, but if you keep it you have to be sure that you get meaningful reduction rules

Pierre-Marie Pédrot said:

this is something I wondered about

I think it looks very natural (and easy to implement) :grinning_face_with_smiling_eyes:

keeping the accessbility term around or not?

(you have to check it's axiom-free still)

you only need to know which variables are in scope, no?

the erased term will not behave the same as the term with the annotation

so when you expand the fixpoint it's not clear that you'll get something that still type-checks

The reason I am after this is that I think something like a Coq based Computer Algebra System is not that far away. Coq is efficient enough for this and not verified math tools are sufficiently painful. In such a use case it would be a problem if defining efficient computing recursive functions is difficult.

I don't think extraction would help me - I need to much of Coq's stuff to do my work. Extracting partial functionality won't help me.

@Michael Soegtrop there are two very different problems though

either you just want to evaluate a function in general, or you want to evaluate it and plug the result in CIC

in the first case you can go through erasure or extraction, but in the second you need to be much more clever

(hence the VM overhead)

Usually I have the second case: proofs about stuff which is partly computational and partly abstract.

I could extract certain parts of my proofs and compute them outside of Coq, but why would I do this if computing inside Coq takes 30s?

Think e.g. of constructive reals: it is hard enough to make this efficient without having termination proofs in the path of computation.

I think my proposal, which does not require to modify the theory would help in that use case @Michael Soegtrop

Think of CertiCoq as a variant of vm_compute which works in just the "closed term" subset but gives you computation on entirely erased terms (not only proofs but also types get erased, which might drastically reduce the size of the terms)

I would still be integrated in conversion like native and vm_compute are.

So no need to "get out of the system"

@Matthieu Sozeau : I see - I thought CertiCoq was something for verified C export - is this part of it or did I get it wrong? Do you have some intro to CertiCoq I could read?

I would suggest looking at chapter 2 of Zoe's PhD thesis http://zoep.github.io/thesis_final.pdf for example

It is currently geared towards compilation to C and from there using CompCert but it can be applied in other settings. I wish I had the time to prototype a wrapper for the conversion test , but it's not looking like it will be ready tomorrow :)

@Matthieu Sozeau : thanks for the pointer! So as far as I understand this, this would work e.g. for a simplex algorithm in Q (see the discussion I had with @Xavier Allamigeon here : Zulip, especially gist on why one might want to do this). For a simplex algorithm in constructive reals I would (or rather might) end up with something which computes to arbitrary precision, but the Cauchy proofs would be stripped away, so that I can't use the results as real numbers any more. This might or might not be an issue.

Doing such things in Q is usually sufficient - doing it in constructive reals would just be an additional convenience and it depends on the achievable performance if one would do that. If one does this in Q, one needs a certain expertise in the decision how to round the involved reals to Q. Doing this in constructive reals would be fool proof - provided one has an implementation of a simplex algorithm which can handle the case that the ordering of two numbers is not decidable with reasonable effort. For that efficient well founded recursion would be helpful.

Last updated: Oct 05 2023 at 02:01 UTC