We are pleased to announce Tactician's API, a new AI interface for theorem proving, building on Tactician. This includes a new graph-based dataset of over 520k definitions (of which 260k are theorems) in 120 Coq packages, one of the largest datasets for AI interactive theorem proving. We also present a new state-of-the-art neural theorem proving solver Graph2Tac, designed for proving theorems in Coq projects with new definitions not seen during training.
See the announcement on Coq Discourse.
We would love to hear your feedback!
I guess the most obvious questions might be:
I'll let @Lasse Blaauwbroek (who built the dataset and interface) answer the first question and a half. Although, these paragraphs from the Graph2Tac paper might prove helpful:
The dataset we utilize is extracted from 120 different Coq packages from the Opam package manager. These packages were selected by a SAT solver as the largest mutually consistent set of packages available for Coq v8.11. Their topics vary wildly, including analysis, compiler and programming language formalization, separation logic, homotopy type theory, and much more. The graph extracted from these formalizations consists of over 250 million nodes, which encode 520k definitions, of which 266k are theorems, and 4.6M proof state transformations.
We divide packages into training and test where no test package depends on a training package. To do so, we induced a random topological order on the Coq packages, with regard to the dependency graph. The resulting list was then split such that the average percentage of theorems and of proof states in the training split is close to 90% (in our case, it is 91.3% of all theorems and 88.8% of all proof states).
Coq 8.11 is now around 4 years old, why focus on that particular version?
Simply because we wanted a stable and immutable base for experimentation. The foundations of this project were laid almost exactly three years ago, so at that time Coq 8.11 was not yet super out-of-date. From a machine learning perspective, consistently running experiments on the same version of Coq and on the same formalization projects allows for tracking progress over time more easily. In addition, consistently upgrading code to newer versions of Coq comes with nontrivial developer cost.
However, we are certainly planning to upgrade the code to newer versions of Coq! We acknowledge that supporting newer versions is important for end users. (Note that Tactician and its default models already track Coq's master branch; it is only the new API for external provers that lags behind.)
was there any particular strategy/methodology when curating the dataset?
We used a SAT-solver (aspcud) to select the largest mutually co-installable set of Opam packages (they need to be co-installable because our dataset is a single, interconnected graph). Besides being the largest dataset we can easily obtain, this has the advantage of impartiality from our side (we did not cherry-pick packages).
For the training/test split, we had the following goals:
x
is in the test set and package y
depends on x
, then y
must also be in the test set.)As for the choice of data and how it affects performance, we didn't do a lot of different ablations on that, but I suspect it will matter in several ways:
as you probably now, co-installability is checked for Platform packages. If you plan to support Coq versions > 8.11, I'd strongly recommend using Platform as the basis and add packages on top of it that are installable. This makes it likely the packages will persist across Coq versions
I just looked through the dataset a bit, and I think quite a few packages are so specialized and small that it's very probable they don't make a difference. At the same time, I'd say there are at least a few million lines of possibly-useful Coq code in projects that don't have versioned packages in opam. Might be worth an ablative experiment.
Yes, we've had discussions about co-installability before: https://coq.zulipchat.com/#narrow/stream/250632-Coq-Platform-devs-.26-users/topic/Platform.20stats/near/328621722
I'm not immediately sure if I see the advantage of taking a superset of the platform. Note that packages persisting across Coq versions is not good enough. For stability reasons, we would need the exact version of the packages to persist, without any evolution of them.
well, no differences is very unlikely to ever happen
Exactly. So from a machine learning research perspective, there is good value in trying to keep a somewhat base of non-moving packages. (Although I acknowledge that this is undesirable from a proof engineering perspective and compromises must be made. The time to upgrade has certainly come now.)
why superset of the Platform? Well, I think we are basically in a place where many releases are actually made for the Platform and wouldn't happen otherwise. Then those releases trigger other releases.
I just looked through the dataset a bit, and I think quite a few packages are so specialized and small that it's very probable they don't make a difference.
Yes, the package sizes have an exponential distribution. Still, there is probably no harm in including small packages.
At the same time, I'd say there are at least a few million lines of possibly-useful Coq code in projects that don't have versioned packages in opam. Might be worth an ablative experiment.
We would love to have more data. Getting data from Opam is just the easiest and most well-curated source (installation instructions come 'built-in'). For our data extraction pipeline, we require Opam packages (not necessarily from the official repo). We simply have not had the time to make these for projects found only on Github.
why superset of the Platform? Well, I think we are basically in a place where many releases are actually made for the Platform and wouldn't happen otherwise. Then those releases trigger other releases.
I agree that Coq Platform has a stabilizing effect on the Opam ecosystem, increasing the amount of mutually compatible packages. This is great, also for this kind of research. I'm not sure if it is needed to explicitly select these packages though. In my experience, a SAT-solver will automatically include whatever is in the platform already, simply by virtue of this being a large co-installable set. So we pretty much agree I think :-)
I think we agree on some things, but the approach to "trust" solver-based package selection I don't agree with, see, e.g., https://twitter.com/Blaisorblade/status/1712530105052893600
by the way, I think you should put the README.md
content as part of description (record metadata) instead of only having it as a separate file:https://zenodo.org/records/10028721
I think we agree on some things, but the approach to "trust" solver-based package selection I don't agree with
Well, we of course check that the package selection makes sense. Sometimes, we indeed have to exclude some 'bad' packages to make the solver behave well. However, my general experience is that Aspcud works quite well (much better than Opam's built-in solver). Once we get to the point of making a new dataset, I'll be happy to consult with the community on what it should look like :-)
by the way, I think you should put the README.md content as part of description
Zenodo's record metadata is quite limited in the kind of rich-text they support. Certainly not markdown. Do you know of a way to write things in the metadata without it being a nightmare?
(The Zenodo support staff tells me they are working on fixing the internal error (500) for displaying the Readme.md, but no ETA.)
not sure how to transfer Markdown into the description directly, but copying the list of packages at least would be helpful
That's an impressive dataset! http://grid01.ciirc.cvut.cz:8080/
It certainly contains more of the packages I care for than CoqGym. https://github.com/princeton-vl/CoqGym
Is it a strict superset?
Have you tried it on SF? https://github.com/DeepSpec/sf
I.e. should we be worried that our students will use it ...?
Thanks! It is not a strict superset to CoqGym. In fact, there are surprisingly few testing packages in common. The big difference is that we collect packages from Opam while CoqGym gets them from Github.
We haven't tried on SF (it's not in Opam). But your students might still use it of course. The whole point of Graph2Tac is that it should generalize to new projects without retraining. And they can also use the k-nearest neighbor model (which is one of the strongest models). The k-NN doesn't require a dataset to learn and will work out-of-the-box on SF.
But neither Graph2Tac nor the k-NN solver are 'overfitted' to SF if that is what you are asking. The solvers do not 'cheat' on SF (even though your students might be cheating by using the models; that depends on your policies of course :-P ).
Yes, I understand. I was mostly suggesting an evaluation framework. I'll try it myself when I find the time...
What packages are pinning the tactician suite to 8.11 ?
Tactician consists of the coq-tactician
and coq-tactician-api
packages. The former is compatible with all modern versions of Coq. The latter is only compatible with 8.11, simply because we haven't ported it to newer versions yet. (There are plans to do this though.) As such, Tactician's built-in models, like k-NN, are available for all versions of Coq. External solvers such as Graph2Tac depend on coq-tactician-api
and are currently only available for 8.11.
Following up from the StackExchange discussion (https://proofassistants.stackexchange.com/questions/2660/coq-lean-endpoint-for-gpt-actions):
Can I use (part of) this to integrate GPT-4 with Coq, in such a way that GPT-4 can execute Coq code? Has anyone built a custom GPT action on top of this?
@Robert Rand In principle, you could use our interaction protocols as a 'tool' that can be manipulated by a LLM. I'm not very familiar with GPT-4 actions. However, I suspect that their workflow is too limited to work with this. It looks like OpenAI is requiring you to run a tool on your own server somewhere. This is different from having OpenAI interact directly with the users' local Coq instance.
As such, if you want to go this route, you'd probably be better off using one of the LSP's, SerAPI, or plain-old Coqtop, and host that on your server.
That being said, if I were to do this, I'd probably bypass their action API entirely, and just engineer my own prompts to get GPT to issue commands for running tactics. Then, you can execute those tactic commands locally and give GPT feedback that way. (Although I'm not sure if this is against OpenAI's TOS.)
This is getting off-topic, but note that a GPT action that uses plain-old coqc
was recently made: https://coq.discourse.group/t/experiment-coq-code-interpreter-in-chatgpt/2161
@Robert Rand, as @Lasse Blaauwbroek already said, Tactician API is probably not what you want. Nonetheless, tell me explain how we do connect Coq and a LLM in our tool in case it interests other people. When doing a proof search, Coq Tactician handles the tree search for a correct proof. At every proof step, it asks our Python server for tactic suggestions. One of our servers Text2Tac takes in a text description of the state and returns a list of possible tactics (in text form). This is similar to other models like GPT-f, PACT, Lisa, and ReProver. Now, instead of training a model specifically for this task, you could use an existing LLM like GPT-4 or Llemma. You just have to use prompting to explain to the LLM what you want. This is exactly what Segredo, Copra, and one of the experiments in the Llemma paper do.
The Tactician API could be used to replicate those above mentioned works, and that is indeed the point of the Transformer baseline in the Graph2Tac paper. (Replicating ReProver and Copra would be more work since there is more information being passed over the interface. Replicating the Llemma experiment in Coq would be interesting @Sean Welleck.)
I made another thread to have a general discussion: #Machine learning and automation > Connecting Coq and an LLM
@Lasse Blaauwbroek @Jason Rute The proof styles of, say, HoTT and math-comp of very different. Do you expect including both in one model gives better performance ?
@Bas Spitters We don't actually train on HoTT since it is in our test set. (Note, we had to exclude the HoTT results from a lot of our evaluations since CoqHammer doesn't work on HoTT, and since one of our models found a way to use axioms to cheat on HoTT.) But nonetheless, your question still stands. Would we, say, do better on normal LTac proofs if we didn't train on ssreflect packages, for example? I can't say for certain. My guess is that having more data here helps especially since (at least for the graph models) we filter the model only to predict tactics available in the global context. But having said that, I'm not 100% sure it wouldn't hurt performance, and we didn't do any experiments like that. Going the other way, even if it doesn't hurt performance, in some cases this sort of thing could possibly lead to non-idiomatic proofs. Like if it gives a LTac proof even when ssreflect would be more idiomatic in this library.
Now, if, as you fear, including very different data in the same training set causes problems, I suspect the issue would be that the model won't understand what context it is in. For example, it might not understand that it is in a setting where ssreflect is not available. (Indeed GPT-4 has this problem with Lean. It confuses Lean 3 and Lean 4 syntax and libraries.) To fix a problem like this, if it occurs, we could do at least two possible things. One thing we could do is give the model more metadata about the context it is in. For example, if we made a joint Lean/Isabelle/Coq model, we would want to have a way to tell the model which theorem prover we are using. But in our setting, I'm not sure what sort of metadata would be helpful. The other possible solution is to make a model which can see recent proofs. Our graph2tac model already is online, in that it takes into account new definitions, but it would even be better if (like our k-nn comparison model) it also takes into account recent proofs in the same file or package. Then the model would be able to draw inspiration from both its extensive training data, and also the style and tactics of the recent proofs.
I see HoTT and fx iris here:
http://grid01.ciirc.cvut.cz:8080/
Did I misread?
Aside, there was a nice presentation about automation for iris here at POPL. People were wondering whether tactician would work for iris (embedded separation logic too).
@Bas Spitters HoTT is in our dataset, but we only trained on about 90% of our dataset and tested on the rest (splitting by whole packages). Of the three Iris packages, two were in training and one was in the test set. You can see the package-wide results on page 21 of the current version of the paper.
You can see our performance there on coq-iris-heap-lang
. The online k-NN (which is similar to the default model in Tactician, and as stated above, works for new versions of Coq) solved about 20% of theorems in 10 mins per theorem. Our graph2tac model solved about 17% or so (just eyeballing the plot) in 10 mins per theorem. Note that the online k-NN works by looking at recent proofs. If there are a lot of duplicate proofs, that would help that model, but your mileage may vary about how useful that would be in practice. The best thing however would be to just download Tactician and try it out!
And to be clear (since you mentioned Tactician), Tactician and the online k-NN we use in the paper are not trained in the traditional sense, so they don't use this dataset at all. Instead, they are online models that can instantly see proofs in the global context. (The default model of Tactician can see all proofs in the global context, and the online k-NN in the paper is restricted to the N most recent tactics where I forgot the value of N.) So if, say, you use Tactician in HoTT or iris-heap-lang you will find that it instantly adapts to your package and that it knows nothing about other packages. (The Graph2Tac model, however, is more traditionally trained. It is trained on the training packages from this dataset, but it also has the ability to adapt to new definitions in the global context. However, unlike the k-NN it can't learn from new proofs.)
(@Bas Spitters What was the name of Iris automation presentation at POPL? Do you have any links?)
People were wondering whether tactician would work for iris (embedded separation logic too).
I've had some private correspondence with @Robbert Krebbers about this in 2022. I ran the k-NN model on some projects related to Iris at the time. These are mostly the Iris projects themselves, not projects that are based on the Iris heap language. In principle, there is no reason why the k-NN would not work on the special tactics related to Iris.
For G2T, I'd expect it to be much less useful. Iris' custom tactics use strings to refer to local context items. G2T cannot predict those. One would probably have to build in custom support to make that happen. A language model, like Text2Tac, might also do reasonably well here.
coq-stdpp : 1360 out of 3189 lemmas = 42.6%
coq-iris : 1822 out of 5051 lemmas = 36.1%
coq-iris-heap-lang: 75 out of 352 lemmas = 21.3%
It was this presentation:
https://ikemulder.nl/media/papers/cpp24-subformula-linking-preprint.pdf
@Lasse Blaauwbroek the tactician website mentioned that tactician is orthogonal to hammers.
Have you compared/combined it to pre-processing and/or smtcoq?
https://inria.hal.science/hal-03901019v2
I always thought "general purpose" should be read in opposition to "domain specific". Neither SMTCoq nor Itauto are arguably domain specific.
since CoqHammer has never claimed to help with induction proofs, it's not general in that sense
Perhaps I'm misunderstanding some part of SMTCoq. But if you look at SMTCoq's USAGE.md, they talk about solving goals of a pretty specific form. That doesn't look 'general purpose' to me.
Regarding CoqHammer: It does indeed not do induction, and arguably that indeed disqualifies CoqHammer from being general purpose. However, the CoqHammer authors themselves have evaluated CoqHammer on a wide set of theorems. So I feel that it has become 'fair game' to do this. (Also, CoqHammer was thought for a long time to be the best solver around, regardless of its inability to do induction.)
To clarify, its not that I'm unwilling to do a comparison to SMTCoq. But I'd like to do that after consultation with their authors, to see what the fairest setting for that would be.
if you use a goal preprocessor like Trakt, it [SMTCoq] seems pretty general to me
@Chantal Keller @Enzo Crance or @Assia Mahboubi may want to comment on Trac/SMTCoq
I'm not an expert on Trakt. Indeed, this is general purpose. However, as far as I know, Trakt requires users to configure it to work within their domain. For example with Trakt Add Embedding
. Wouldn't it be quite challenging to do that automatically on a bunch of arbitrary Coq projects?
Perhaps, that could be something done by machine learning ??
Yes. It appears that some kind of alignment learning might be useful here. I'm for example thinking of Thibault Gauthier's work on aligning concepts. (But now we are talking about an entire line of research, and not about a simple quantitative comparison of automation...)
if we're doing "logic" preprocessing for SMTCoq (just to get goals using logical connectors and the like into a form supported by SMTCoq), I think that is already implemented in Trakt
at least that was my impression after I talked to the main Trakt author, but I may have misunderstood
Itauto already does its own similar preprocessing and has some typeclasses one can register stuff like decidable equalities with
Last updated: Oct 13 2024 at 01:02 UTC