@Enrico Tassi that's my exactly my point : why on earth do you do the wrong thing by default? backtracking should be opt-in.
I agree.
Oh my.
But this mistake has very good consequences. Without it you could not organize your programs in clauses that easily
And this is key to =>
(adding a clauses to your program within a scope).
Which in turn is key to handle binders
and many other idioms
I don't have a particular opinion on λ-prolog, apart that it looks like Frankestein.
Popping up a level: how should terms be represented in an ideal (ly performant) system? Terms + hashconsing, DAGs with pointers everywhere, functional structures with "zippers", other?
Should Evars be mutable?
was Frankestein elegant?
Hashconsing is another beast. Filliâtre claims that everything should hashcons implicitly, but the assumptions of this reasoning is flawed in Coq.
Cody Roux said:
Should Evars be mutable?
it's an implementation detail. According to my experience: yes.
The problem with mutable evars is that I don't see how not to have a super buggy engine.
Barring a type system à la Rust
you trail the updates?
you could do that, but it has a cost as well.
especially in a very backtracking setting you'll lose all the benefits.
yes and no. a pretty standard (necessay) optimization is that if you don't have choice points, then you don't trail. if you do so, you are as efficient as ocaml (with mutable state).
looks reasonable.
but you also gain ocaml's GC on this heap we call evarmap
today we have no GC at all
Yeah, it's a clear bonus.
elpi "POC" (in the git repo) was written like Coq. When we move to imperative evars + trail we got 10x speedup
No GC is bad for memory consumption but I don't think it affects speed that much.
I am not surprised.
maybe in Coq it would not be that much, but I would be surprised if it was not visible
Is there some "canonical paper" for how to implement mutable Evars in this way? Are trails just suspended substitutions?
the trail is just to undo assignments
you still need to apply evars to all its arguments
that happens when you get a unification failure?
which is a sort of suspended substitution
if I get what you mean
I feel like a 3rd grader in this thread
From my experiments, evar access has a clear cost but it's not the major source of speed issues.
@Cody Roux a trail is the standard trick to implement "immutability"-like behaviour with mutable state.
like in persistent arrays
it's a log of things you have to undo
Ok. But in what situations do you have to undo Evar instanciations? I feel like I should know this...
If you backtrack.
tac1 || tac2
Ok. But this doesn't happen at elaboration time.
if tac1
assigns some evars before failing, then tac2
should start from scratch
Oh, it probably does.
Who knows what case elaboration does!
eg, during type class instance search
Right, but that's more of a bug than a feature.
Type class search is collapsed to its first result in elaboration currently.
aka "the pretyper is not in the monad"
So in elaboration proper we don't really have backtracking. (We do in unification, type class search, tactics in terms, etc.)
Ok this is good to know.
We would like to have a bit more controlled backtracking in some cases.
Like the pretyper assigning universes too eagerly.
TC search can backtrack internally
The dreaded "foo has type Type but was expected of type Set" error
where you never wrote Type in the first place
anyway, backtracking on the beginning of the conversation
proof engine needs fast evars
I wish I understood universe polymorphism better tbh. When I was young and stupid I thought it was easy to have explicit universe quantifiers in the kernel, but now I'm not so sure.
It is easy.
It is just a catastrophe performance-wise.
Ok, maybe I was confused by the number of critical bugs involving them.
Ah, this is just a historical accident due to 8.5
"The StillBorn 9.0 from Hell"
I love how we present Coq to other people vs how we speak about it internally
were three (four?) major branches were merged at the same time, resulting in a code mess we're currently finishing up to clean
Note that I got user reports that CoqIDE 8.5 was better than CoqIDE 8.6 :-P
Enrico was the leader of the ParalITP faction
STM + univpoly + primproj + logic-monad
"Forever thou shalt be punished for the introduction of the STM"
what else?
univpoly / primproj are in the same package
Software Transactional Memory?
I'm about to replace it with a new thing that does the same but is called differently
State Transaction Machine
the name has been long forgotten since
it's a mix (hence the mess) of a document manager and a document-checking manager
first hit on google! Nice.
it is but a cursed syntagm that often bars your path in the way to enlightment
I remember confusing conversations with Markarius actually.
so yes, merging three major branches in one release is not good for the soundness of your kernel
Pierre-Marie Pédrot said:
univpoly / primproj are in the same package
that was part of the mess, IMO ;-)
ouch
true also
the compat layer
oh noes
I'm going to bed now
you'll have nightmares
too late
Thanks to both, this was very informative!
the eldritch horrors of side-effects are going to haunt you beyond non-euclidean monadic code
If you say so ;-)
let (*) x y = x + y
(today, PMP)
good night
wut. this is perfectly legal.
g'night
No really. The only thing it takes is an amateurish experience of trying to implement your own DTT/ITP
proof assistants are very complex objects.
No GC is bad for memory consumption but I don't think it affects speed that much.
Memory consumption should affect speed... Except if you've already given up hope for cache locality.
which seems plausible anyway, ASTs are a nightmare for locality. IIRC this seemed a major pain point for Scala compilers, even with Java’s great GC — probably the lack of value types in JVMs makes locality there even worse than needed.
I wonder if Lean's ref counting + reallocate if refcount = 1 and you're creating a cell of the same type helps with this
ah, I was assuming imperative ASTs; purely functional ones are even worse, but destructive updates are slightly less bad
I haven’t done measurements recently, but basically the cache footprint is O(size of written memory) (rounded up, since each write pulls in a whole cacheline — ~16 bytes), so n purely functional updates are likely to have n
times the cache footprint of the imperative version, until you trigger a minor GC and start reusing memory.
OTOH, naive refcounting can be a pretty expensive form of GC, especially when done pervasively in a language implementations — it’s even worse as soon as you need atomic refcount operations because you’re running more than one thread. There’s a reason no good virtual machines use refcounting.
but that doesn’t necessarily apply when you’re doing mostly manual memory management with just a bit of refcounting as needed.
I'm 100% with Enrico on this discussion, I think Coq's proof engine has several catastrophic performance decisions, the more I have profiled it the more scary it looks, linear memory consumption for each goal! It is hard to do worse! In that sense researchers have an easy job as almost any other approach they use will be faster.
Pierre-Marie, logic programming is a fast implementation of a slow idea. And no doubt techniques they use are _at least_ an order of magnitude faster than what we have.
[I see that simpl
appeared in the scary thread, I can sleep now]
Evar implementations is also something that has been studied in isolation...
... and some mature LPs add tabling :-P, even if it's much scarier
I'd love to see a bulleted list of the worst offenders in terms of day to day Coq usage. Obviously that might depend on the development, but what makes each goal linear for example?
As I mentioned above, the naive evar implementation has a definite cost but it's not a leading one in day-to-day use. I believe that using an efficient representation for evars wouldn't bring more than a ~15% speedup (educated guess). So it's not neglectible, but it's not a tenfold improvement.
The linearity of goal size is due to the representation of evar instances, which are just the list of its arguments.
This is a stupid representation since for most of those arguments, they will just be the corresponding variable of the context.
Unfortunately one cannot use a compact representation without seriously changing the kernel.
Typical suspects for day-to-day slowness are: 1. universes, 2. hashconsing, 3. a hell of a lot of naive term traversals in unification.
I am working on 1. because it's really a pain and it can probably be alleviated with a bit of engineering (e.g. persistent arrays)
Pierre-Marie Pédrot said:
I am working on 1. because it's really a pain and it can probably be alleviated with a bit of engineering (e.g. persistent arrays)
how do persistent arrays help universes?
The fact that we have a log n access cost is observable.
amortized O(1) would be better
unfortunately the AcyclicGraph API is a bit ugly, so we have to first simplify it
- universes
This thread suddenly became very interesting to me! Are there any plans to address the slowness of CClosure with polymorphic constants? I have flamegraphs if people want to see scary stuff. (yay, back on topic!)
Oh, that one is a completely different issue.
Polymorphic universes have so many performance issues that I don't think the secondary ones should be prioritized.
I mean, I won't oppose your PRs, but when universes computation represent ~ 20% runtime of some heavy monomorphic developments, I am afraid I will not spend a lot of time trying to fill a Danaides barrel
My personal take on that is that polymorphic universes have fundamental design issues w.r.t. performance.
Ooh, sorry, I assumed you were talking about polymorphic universes.
In any case, polymorphic universes easily represent 50% of my runtime :)
Well, polymorphic universes are also affected by slow constraint folding.
Oh, I know. My PhD studend was brave enough to start writing shallow models of CIC in CwF style.
It's a small dev, a few hundred lines.
Very quickly all definitions have hundreds of free universe variables.
And then suddently, a wild Qed appears, which takes 40mn, most of which is due to universe trickery.
It's impressive how easy it it to completely blow up the complexity with univpoly.
The trick to get fast Qed
is to use the buggy Time Qed
which ignores the time taken by universe minimization!
I think that in this case minimization is not culprit, the kernel is really suffering and panting heavily.
Ah, then I have no solution to offer :D
But I do want to mention that I observe obscene CClosure slowdown even when polymorphic constants have very few universes.
Yeah, it' proportional to the term size, not the number of universes
(as long as nuniv > 0)
Right.
I think that one of Beta's PRs could have been tweaked to be more efficient and partially solve your problem.
IIRC Beta closed it because it had a nontrivial cost over monomorphic devs.
I think that was https://github.com/coq/coq/pull/8747/, right?
Ah yeah. My bad, it was your PR.
I recently asked Josh to help me revive it but I didn't have time to get that started.
Shouldn't be very hard.
I have to admit I have no idea what exactly is going on there. The initial implementation was easy enough to understand but then Josh took over and it quickly went over my head.
I happen to be working on substitutions today.
maybe I should try to do the same patch in my variant of substs.
(https://github.com/coq/coq/pull/13537, the revival of a PR I've closed, but I am decided to fix the small slowdowns it incurs)
I am rereading the changes in 8747 now and it doesn't seem that scary, actually.
Yeah, I agree.
Regarding 1. universes
I think the problem comes from the fact we use a too simplistic universe algebra too. We cannot keep l+1
in universe instances so we always introduce a fresh l'
(usually with l+1 <= l'
) when needed. Each subsequent use of a definition where we used that will generate a fresh universe and constraint, and this will just pile up to give humongous universe instances. If we allow algebraics everywhere we shall need much less fresh universes, actually 0.
I partly subscribe to this thesis.
That is, it is fairly obvious we do generate too many universes.
But I don't believe that introducing algebraics everywhere will be a silver bullet.
Regarding your proposal Janno, I guess it would be relatively easy to make CClosure not substitute universes eagerly by just maintaining a current universe substitution. You never tried that Pierre-Marie?
Isn't that essentially what that PR did?
I don't think it will solve all issues, and inference will be more complicated probably, but it will help
BTW, a common offender for universe generation is setoid_rewrite.
It refreshes the pattern a gazillion times.
I suspect there is actually a bug, since we should be able to share the fresh pattern when it doesn't unify.
Pierre-Marie Pédrot said:
Isn't that essentially what that PR did?
Yep, it seems very similar
Re setoid_rewrite. I though we had an optimization that avoided refreshing unless necessary (e.g when going under binders, and even that could be optimized by "weakening" the evars in the pattern)
Pierre-Marie Pédrot said:
As I mentioned above, the naive evar implementation has a definite cost but it's not a leading one in day-to-day use. I believe that using an efficient representation for evars wouldn't bring more than a ~15% speedup (educated guess). So it's not neglectible, but it's not a tenfold improvement.
To me a "decent" representation of evars includes the fact that the "heap" would be garbage collected.
As we know, today the evarmap is not garbage collected, and at the end of the proof is contains the entire history of the proof building process.
Of course, not having a GC is not a problem if you have enough memory and if you never traverse it. Both assumption are false, since we have users hitting the 64G limit (of normal machines) and the ocaml GC is going to traverse (for nothing) this huge heap from time to time.
As I said elpi was coded like that a long time ago. Doing list reversal was, IIRC, 22000x slower than ocaml. By "manually" adding instruction to remove from the heap useless cells, it went to 1000x slower IIRC. Of course in elpi everything is an evar, in Coq you do other stuff as well. But using less memory is also making your (ocaml) code much faster.
I would try to use weak pointers in the evar map. I don't know if ocaml can handle so many of them, but at least we would get a figure. with "imperative" evars, one would not need weak pointers, but this change is much harder to do and get figures about.
@Enrico Tassi evar maps do not only contain evars, universes are I believe more often than not a huge memory leak.
I was profiling memory consumption of Require recently and realized that a good chunk of the allocated memory in perennial came from universe naming.
Like, we want to give fancy names for universes, and actually 99% of our users don't give a shit, but for them it will still be a few hundreds of MB
For universes, I have a different dream. A -type-upto 3
flag which keeps at most 3 non-floating universes in the graph and collapses all floating one the these 3 after each type inference.
640k should be enough for anybody?
I boldly claim 90% of computer scientists and 80% of mathematicians would be happy with that.
The problem with that is that you will probably make eager choices wrongly, even for 99% of users.
Also 3 is a nice number ;-)
I'm not so sure. 99% of the times you mean Type0
I'd be interested in such an experiment.
But we would probably realize some horrible problem midway.
Practice should tell, but if we have algebraics everywhere, maybe a lot of people would be content to use Set, Set+1 and Set+2 directly
At some point @Assia Mahboubi did it with math comp (a long time ago) by replacing all Type
with MyType
from Definition MyType := Type
. IIRC she needed 3 MyType
.
Matthieu Sozeau said:
Practice should tell, but if we have algebraics everywhere, maybe a lot of people would be content to use Set, Set+1 and Set+2 directly
It would certainly be useful in the doc.
Also, if Set
is really Type_0, it would be nice to have a way to check for this explicitly in Coq.
@Matthieu Sozeau what are you waiting for? That's more useful that induction-recursion, which will be used by Agda fanboys to prove the termination of simply-typed λ-calculus for the 4591460th time.
Well, I need help with the Bender & al algorithm :)
For the -type-upto 3
thing, I think any algorithm would be efficient enough ;-)
Théo Zimmermann said:
Also, if
Set
is really Type_0, it would be nice to have a way to check for this explicitly in Coq.
What do you mean?
Note that in the case of Set, Set+1 and Set+2, we don't even need to check the graph most of the time
For the algebraic case... well, not so easy I'm afraid. (I mean with succ and max freely mixed)
@Enrico Tassi but probably better in the long run
@Enrico Tassi I'm wondering how you could decide if you have x < y
what you should assign to x and y.
X := Type0, Y := Type1
yeah!
So kind of "minimal" solutions?
Actually, I believe that we should have a more principled approach for UGraph where this would be easy to perform.
Currently we have this hacky type-in-type stuff.
If we could use various implementation of the graph, it could be easy to switch from them.
There is no conceptual difference between a graph algorithm that collapses everybody on one point, and one that collapses only le nodes.
For me with type-in-type we should not consult the graph at all. But yes we can also interpret < and <= as = to model the same thing
I have the impression that in such a mode one could live with Set=type0, Type=type1, U=type2 (I mean, even be explicit in the concrete syntax).
For _
one can still delay the choice, but it seems hard to believe that one gets many i < j
for i
and j
not related to one of the 3 named universes.
if we have algebraics in instances we can also implement non cumulative mode
@Gaëtan Gilbert yes, indeed that's one nice benefit.
Down with Cumulativity! Let's turn Coq into Agda!
Haskell!
@Enrico Tassi indeed in many cases they will be related
That is my intuition, but I understand that "it works for 3" is not very satisfactory, intellectually speaking...
Enrico Tassi said:
At some point Assia Mahboubi did it with math comp (a long time ago) by replacing all
Type
withMyType
fromDefinition MyType := Type
. IIRC she needed 3MyType
.
Indeed. Disclaimer: it was a long time ago and mathcomp was way skinnier at the time.
That's still some good evidence!
Enrico Tassi said:
That is my intuition, but I understand that "it works for 3" is not very satisfactory, intellectually speaking...
It is a common situation in computer algebra systems to implement a portfolio of algorithms for a same task (e.g. for multiplying matrices or polynomials), each of which being better for a different class of inputs, due to the size of the constants hidden in big O....
we have Print Sorted Universes
to tell use how many are needed btw, no need for hand replacing Type with MyType
This idea was to measure the gain in performance. It was much faster with explicit definitions.
see also https://github.com/coq/coq/commit/1c25f8bc8d926e85667d321555775200ddfd17d2
Are you saying that we test that the stdlib works with 3 universes?
@Enrico Tassi yes.
I think it's 5 currently, but that's the idea.
Actually 4.
see test-suite/misc/universes.sh
@Gaëtan Gilbert speaking of which, I am currently extruding universe sorting / printing from AcyclicGraph / UGraph, are you fine with that?
depends how the code looks
but why?
This stuff has nothing to do in the kernel (and it's the sole user of safe_repr :/)
I want to replace the implementation of Point, and it's a mess.
sort is a big hack, to start with.
Instead, I will just dump the graph and print that in vernac.
if Hurkens needs 4 to prove false, then limiting to 3 universes is also about soundness ;-)
@Enrico Tassi You, sir, are a genius.
I'm afraid @Pierre-Marie Pédrot I cannot share your analysis and much less the 15% number you come with, I think Enrico gave numbers more close to the actual evidence people have. Current engine just is not in the same class of perf than other proof search systems.
By linear, I didn't mean that memory consumption per-goal is linear, which is, but it is also linear on the number of evars ever created. That's just scary. We have discussed many times about the proof engine, and I'm still to be convinced about the defenses of many things, including the high-level design where refine is not a part of the primitives.
@Pierre-Marie Pédrot can we try to replace the IntMap in the evd with https://caml.inria.fr/pub/docs/manual-ocaml/libref/Ephemeron.GenHashTable.html (plus a call to nf_evar on the initial evar, say, every 1K evar allocations, or every end of run
of the tactic monad)?
You'll probably get evars reclaimed to eagerly, but, yes, please try.
Also, weak cells / ephemerons are far from being free.
I'm too busy with the release, but maybe someone at the CUWD would like to play with this
I try something like that but I never managed it to work, even after adding a copy [keep in mind that evar_maps are purely functional whereas hash may not]
But indeed we have some other problems like the imperative evar_counter, that's a mess once we try ocaml-multicore
Not even equality of evars in two evar maps is properly defined if we want to assume some separation rules.
Imagine how fast will be Coq + multicore be when each evar creation does introduce a global lock :S
But let me repeat something: universes are a worse hotspot than that.
Not so sure in view of @Jason Gross results
how can you measure that
This has nothing to do with the imperative representation of evars.
the kind of slowdown we are talking about is hard to measure, because it is intrinsic
It's linked to the representation of instance, which is not related.
we are not just disucssing "imperative representation of evars"
that alone does little
I have no idea what you want to implement then, because I don't see how you're gonna gain anything with it.
we are talking about the whole design of the proof search engine, you can have an imperative representation and perform worse than the current evar_map , some advanced tricks are needed to make it fast
Evar creation is already linear in the size of the instance.
Anyways, I prefer rebasing my branch than vacuously trolling, so have fun.
for example, I'm convinced that evar == goal is not the right notion; I'm just repeating the discussion
But don't expect even a 2x speedup, however fast your evar will be.
we had some time ago
instantiating a unification variable is a totally different concept than solving a goal
and indeed the engine does not provide operations for neither
engine I mean the proofview
I'm not trolling by the way as I'm dead serious
I am waiting for your PR.
I wish I had the time to back my claims with code
I don't say this is a small project
I am already busy trying to have coqc not to recheck the same stuff over and over again when a user changes a line :)
which is like 2 order of magnitude less ambitious but very tricky still
I would kindly recommend getting rid of the STM if you want to stop having a gaping memory leak in coqc, btw.
Re. the representation of evars: even if we made evar "closed" i.e. of closed type (forall ctx, ty), I don't see how that would help in any way with the issue of evar instance representation, rather the other way around: it would make it harder to have compact representations for instances unless we go to fully represent explicit substitutions in terms. However that experiment (using explicit substitutions pervasively) was tried and failed in both Coq and Matita IIRC.
I was under the impression that the bulk of the performance issue in the rewrite variants was allocating evars every time there's a head constant match or something like that, and so there's this pervasive issue where rewriting gets slower the more lemmas being used to rewrite. Is this not the case? (Btw, if it's helpful, I can open an issue with the "runs out of RAM and is way too slow" example from my defense)
In past projects I used something like this:
type 'a ast =
| X of 'a node
| …
and 'a node := { annot: 'a; data: 'a ast }
Then most of the code operates on 'a ASTs and there's a function reset_annots: 'a AST -> 'b -> 'b AST
that you use before doing a transformation. This works well for turning quadratic operations on the AST into linear, but you don't get incremental updates of the metadata when you make changes to the AST (you have to reset all metadata before each pass). @Pierre-Marie Pédrot have you considered something along these lines?
Note that this is already implemented in Coq, see lib/cAst.mli
Ah, super. Then I didn't understand PMP's point about quadratic blow-up
@Clément Pit-Claudel that's not what the kernel uses. And even if we wanted to use this structure, it'd be an engineering nightmare. It'd require a serious rewriting of Coq, we'd pay non-trivial overheads all over the place, and it's quite dubious from a trust point of view that you can maintain meaningful invariants there.
CAst is only used for syntactic data the user inputs, FTR.
Oh indeed I was talking about the AST part, kernel terms are a different story.
But I think in most cases you are dealing with ASTs
Last updated: Sep 15 2024 at 12:01 UTC