Hi all, I am co-organizing a free virtual workshop on AI for Mathematical Reasoning the next few days (June 12th-14th) with the (US) National Academies. 1900 people are coming so far, and this includes a fair number of PL/FM speakers (e.g., Thierry Coquand, Greg Morrisett, Brigitte Pientka, Aleks Nanevski, Moshe Vardi) and some general all-stars (moderators besides me include Terence Tao and Yann LeCun; other speakers include Sean Welleck and Stella Biderman).

There is a strong proof assistant angle to all of this. The goal seems to be for the NSF to pour funding into AI for Math, with formal proof as one application area of interest. Please sign up if interested: https://www.nationalacademies.org/event/06-12-2023/ai-to-assist-mathematical-reasoning-a-workshop

Talia

(Admittedly, I should have mentioned this earlier, but have been a bit busy. The timing also may be hard from France. But we'd love anyone who can make it to join and watch the talks and ask questions via Slido.)

It's a bit late, but maybe it's still worth an announcement on Coq-Club + Discourse ?

@Talia Ringer Exciting! I've signed up, but will not be able to follow all the talks due to the time difference.

Any chance the lectures will be recorded, slides collected?

Jeremy mentioned this paper which gives an interesting approach to using LLMs.

https://arxiv.org/abs/2210.12283

Coq does not seem in focus for these research groups. I'm not entirely sure why?

Missing declarative language? Lesss development of hammers? ...?

I think we have been able to show every time this question comes up that it isn't really for technical reasons (e.g., ecosystem, maturity, features, datasets, stability, etc., typically favor Coq on the margin)

but as seen in the discussion on Nuprl, decisions to use some software technologies can have "local" context

with that said, we could definitely market the Platform better for LLM/learning. 2M+ lines of code, guranteed/tested to work, releases about every 6 months.

Maybe @Talia Ringer can plug it at the meeting :-)

They claim the largest corpus is in Isabelle, but measure it at 0.6GB which isn't the metric I was expecting

yeah, the Platform and AFP aren't really straightforwardly comparable since the Platform by design is a "reusable fragment" of all Coq theories/sources. Then it'd be more fair to compare the Coq opam archive for a specific Coq version (not the latest, please) with Isabelle AFP for a specific version

Coq does not seem in focus for these research groups. I'm not entirely sure why?

@Bas Spitters there is a pretty nice draft https://dependenttyp.es/pdf/repairdata.pdf that outlines some of the problems Coq has, I think that the main problem is that the required infrastructure is just not a priority for the team / lack of resources to work on it (both in a sense are they same)

I've also heard it suggested, by folks that have worked on both Coq and Isabelle ML proof projects, that Isabelle proofs tend to spell out of more of the proof state in the syntax of the proof script, whereas in Coq that stuff is a bit more implicit. The former lends itself better to vanilla language models, whereas the latter usually requires some sort of two-stream model (the states and the script).

there is no computation / Poincare principle in HOL, I guess that's what they mean? The proof "term" will always be as big as the "proof".

if they are referring to idiomatic Isar proofs, something very close to those can arguably be done in SSReflect/MathComp, but there isn't really a use for that in most SSR/MC projects (computation/reflection is pretty nice to have around)

anyway, people seem to be using HOL Light just fine for learning, and those proofs [bundled in HOL Light] are *not* Isar proofs (even use `EVAL`

quite a lot that hides stuff)

For me an important take away was the talk of Stanislas Dehaene, who more or less compared the performance of humans, AI and non human primates. As far as I understand what is missing to make AI more powerful is a world model that AI can manipulate and "understand" this way. There are many examples where e.g. ChatGPT can do things by "manipulating" other tools. E.g. my kids recently sat on the sofa and asked ChatGPT to write SMT code to solve certain logic riddles - which it did just fine and correctly - for riddles which otherwise were completely beyond it. Similar to the comment of Stanislas to ask AI to "create python code to draw a green triangle left of a blue circle" instead of asking stable diffusion like AI to draw such an image directly.

Coming to the Coq vs HOL in AI question: I expect that a system with an inherent idea of computation is likely more suitable as a world model to be manipulated by AI than a purely axiomatic system. I think it is harder to "play" with axioms than with something which computes. A computational system is IMHO more like bricks - if you kick a bottom block it will fall down. In an axiomatic system you somehow have to know that this will happen. So it is more difficult to build up intuition about it.

@Talia Ringer : thanks for sharing - I find the talk series mostly very interesting!

In fact, the lecture videos are available from the past two days.

Some live blogging:

Avraham Shinnar, IBM Research talked about formalization of Machine Learning in Coq:

https://github.com/IBM/FormalML#readme

He also had some good points/suggestions about how to integrate LLM into vscode (for Coq).

@Karl Palmskog Wenda Li saya:

I believe it is mainly because many of the Coq proofs are scattered in different repositories -- they may even not be compatible with the same version of Coq.

As for the size of available proofs, Lean has 1M LOC in matlib and Isabelle has around 6M LOC (4M in AFP, 0.5M in the standard library and 1.5M in the seL4 project)

Maybe you can reach out to him regarding the statistics ?

not sure there is a lot of point if those are the starting assumptions.

I'm assuming he just does not know how Coq platform works. I'll contact him.

I'd be open to setting up some kind of real comparison of lines of code for Coq Platform + active Coq projects, for a single version of Coq - compared to say, AFP and Mathlib.

For example, we already measured *core* MathComp projects at around 200k LOC in 2019 (excludes analysis, etc.) for our non-proof learning ITP conference paper (https://arxiv.org/pdf/2004.07761.pdf). When analysis and others from extended MathComp family were added, this becomes upward of 325k LOC (four years ago)

You could also add ssprove.

Regarding the metric. Russell O'Connor ones proposed to count the gzip size of a repo instead of the LOC, as it is a better metric for the information content...

In fact, @Wenda Li has an account here.

Karl Palmskog said:

For example, we already measured

coreMathComp projects at around 200k LOC in 2019 (excludes analysis, etc.) for our non-proof learning ITP conference paper (https://arxiv.org/pdf/2004.07761.pdf). When analysis and others from extended MathComp family were added, this becomes upward of 325k LOC (four years ago)

I did some quick `wc -l`

on repos in the math-comp org + some math-comp repos in coq-community + ssprove and I end up with a total of 524k LOC.

Karl Palmskog said:

not sure there is a lot of point if those are the starting assumptions.

Apologies for any misassumptions I have regarding the Coq ecosystems :rolling_on_the_floor_laughing: I am indeed not familiar with the Coq platform before. My impression with ML for Coq comes from a previous paper:https://arxiv.org/abs/1905.09381, where the author spent quite amount of time collecting data from different Coq repositories.

Sources for Isabelle statistics are here: AFP and seL4 (they claim about over 1M of LoC but the last time I check it was like 1.3M or something). The standard library of Isabelle is here, I calculated it myself a while ago.

Thanks. :-)

Do we have any statistics of the verbosity of Isar vs fx ssreflect? I'd say ssreflect is much more concise, but I'm not sure how sledgehammer influences that.

@Wenda Li OK, I see, just to give a quick overview of the Coq ecosystem:

- the Coq Platform has a bit above 2M LOC in the last release, but that is basically
*utility libraries*and plugins and other immediately reusable artifacts - the Coq opam archive typically has 100-200 packages per Coq version, which overlaps with the Coq Platform. We didn't summarize the LOC count, but my guess would be at least 1-2 million LOC more.
- there are many big Coq projects that do no have packages but are in Coq's CI, which has 2.5 million LOC when we last checked (overlaps with Coq opam archive and Platform)
- almost all 60 projects in Coq-community work with the lastest released version of Coq (again some overlap with Platform and Coq opam archive)

Unfortunately, the authors of ASTactic/CoqGym were not familiar with Coq's ecosystem (and we did not have the Platform yet)

@Bas Spitters That is a good question but I am afraid that I don't have an answer yet. My impression with Isar is that it is indeed more verbose than the tactic-style proofs, and people in the Isabelle community tend to prefer more structural proofs that resemble a paper proof than make them short. People in the Lean community, on the contrary, prefers short proofs a lot more often (they 'golf' existing proofs).

@Karl Palmskog Thanks for these statistics. How easy would it be to deduplicates those? I get those statistics could be updated automatically and e.g. published on awesome Coq or similar...

from what I remember @Michael Soegtrop has a script that he uses to compute LOC for the Platform, we have discussed this before here

ah, this one for example: https://github.com/coq/platform/blob/main/maintainer_scripts/count_lines_of_code.sh

we may be able to set up this to run in Platform CI if not already done. My proposal is that this could go in the Coq Platform release notes, e.g., at the bottom of https://github.com/coq/platform/releases/tag/2022.09.1

Regarding verbosity, this might be a start https://www.cs.ru.nl/~freek/100/

Although, this would still be very rough, and we should probably exclude classical theorems, as there will be an overhead for those in type theory.

with SerAPI, one can do some interesting things like computing number of tokens from Coq's lexer per sentence. This could be used on SSR proof scripts.

@Karl Palmskog Huge thanks a lot for sharing those statistics -- I now see the huge potential of the Coq corpus for the ML folds :-) One way that I believe the Coq community can help the ML folks perhaps to provide an easy and usable platforms for carrying out ML experiments, so that people can easily interact with the Coq kernel via languages like Python. This is really the most challenging part for the ML researchers.

@Wenda Li could you be more precise about what would be needed? I.e. what is provided by lean and isabelle and not by Coq?

@Wenda Li we do have projects like these:

- https://github.com/ejgallego/pycoq (accessing Coq directly from Python building on the OCaml bridge from/to Python)
- https://github.com/ejgallego/coq-serapi (serialization/deserialization of Coq's internal datastructures)
- https://github.com/ejgallego/coq-lsp (LSP server and protocol)

Bas Spitters said:

Regarding verbosity, this might be a start https://www.cs.ru.nl/~freek/100/

Although, this would still be very rough, and we should probably exclude classical theorems, as there will be an overhead for those in type theory.

When Larry Paulson attempted to port libraries from HOL Light to Isabelle, he mentioned that many of the long HOL Light proofs can be one-lined by Sledgehammer in Isabelle. So Sledgehammer might slightly reduce the verbosity introduced by Isar proofs.

unfortunately, I think we have not succeeded in getting the word out to machine learning researchers. I have co-written papers on the Coq Platform and using tools like SerAPI for learning, but not easy to reach the right audience:

- https://arxiv.org/abs/2004.07761 (Deep Generation of Coq Lemma Names Using Elaborated Terms, ITP '20, uses SerAPI + Coq + RNNs, builds MathComp corpus)
- https://arxiv.org/abs/2203.09835 (Reliably Reproducing Machine-Checked Proofs with the Coq Platform, overview of Platform)

@Wenda Li yes, that's a good example. Can you get some more precise statistics from that? I understand the mathematical content of the libraries are really very comparible.

Bas Spitters said:

Wenda Li could you be more precise about what would be needed? I.e. what is provided by lean and isabelle and not by Coq?

In Isabelle, we have a Python interface to interact with the kernel. Similar pipelines are in Lean.

Also, people in the ML community value benchmarks to evaluate their progress. miniF2F was an attempt to build a cross-planform benchmarks for high-school competition problem, but AFAIK the Coq track is not finished because they didn't have experts to label the problems.

Karl Palmskog said:

Wenda Li we do have projects like these:

- https://github.com/ejgallego/pycoq (accessing Coq directly from Python building on the OCaml bridge from/to Python)
- https://github.com/ejgallego/coq-serapi (serialization/deserialization of Coq's internal datastructures)
- https://github.com/ejgallego/coq-lsp (LSP server and protocol)

Thanks again. That is something I didn't know before! I see we are in a better world now :-)

I looked at miniF2F, but I think one problem (that became more apparent later) was that it wasn't really a community project but based at OpenAI, and is seemingly unmaintained after departure of creators from OpenAI. We may need to think about infrastructure that could support inter-proof-assistant collaborations in the long term (the Dedukti people probably have the most experience on this)

@Wenda Li I try to track all important Coq projects and tools, including for machine learning here: https://github.com/coq-community/awesome-coq

@Karl Palmskog seems like a good question to raise after the next panel on slido. I'll be switching trains, so won't be able to slip it in.

Maybe @Wenda Li can raise it too.

@Wenda Li who are the key AI people the coq community should reach out to to clarify what is currently available?

Bas Spitters said:

Wenda Li yes, that's a good example. Can you get some more precise statistics from that? I understand the mathematical content of the libraries are really very comparible.

The exact comparison between HOL Light and Isabelle proofs might take a while to develop, but I would say most of HOL Light math library are now in Isabelle.

Another issue of ITP for ML is that few ML folks understand program (language) reasoning, which is what Coq extremely excels at. For pure math in Isabelle, my rough impression is that 1-1.5M LoC are purely about math, though some of the program/data structure verification projects involve quite a bit math as well.

Bas Spitters said:

Wenda Li who are the key AI people the coq community should reach out to to clarify what is currently available?

@Talia Ringer is certainly a key person well-known by both community :-)

Karl Palmskog said:

I looked at miniF2F, but I think one problem (that became more apparent later) was that it wasn't really a community project but based at OpenAI, and is seemingly unmaintained after departure of creators from OpenAI. We may need to think about infrastructure that could support inter-proof-assistant collaborations in the long term (the Dedukti people probably have the most experience on this)

That is true. I hope we academia folks will be able to build more such benchmarks that would be meaningful to our ITPs people :-)

Bas Spitters said:

Karl Palmskog seems like a good question to raise after the next panel on slido. I'll be switching trains, so won't be able to slip it in.

Maybe Wenda Li can raise it too.

I am probably not a good candidate to raise questions on behalf of the Coq community, but I believe @Karl Palmskog can certainly clarify many of the misconceptions (on Coq) to the ML community.

programming language property certification is probably an understudied problem in AI. The proofs are typically long and tedious but doesn't require as much insight as modern math. Yet, the payoff in terms of monetary value can be huge (guarantees about all programs in a language)

my understanding of (deep) AI is not really great, but one of my PhD student collaborators who did his PhD on AI + programming languages (including Coq) is becoming an assistant professor later this year. Maybe with some help from him I can try to reach out, my bandwidth tends to be a bit limited these days

Karl Palmskog said:

programming language property certification is probably an understudied problem in AI. The proofs are typically long and tedious but doesn't require as much insight as modern math. Yet, the payoff in terms of monetary value can be huge (guarantees about all programs in a language)

Totally agreed. I personally believe many of deep learning problems will be solved via gaining some insights from the PL community.

Karl Palmskog said:

my understanding of (deep) AI is not really great, but one of my PhD student collaborators who did his PhD on AI + programming languages (including Coq) is becoming an assistant professor later this year. Maybe with some help from him I can try to reach out, my bandwidth tends to be a bit limited these days

I will be joining Edinburgh as a lecturer this September working on AI and ITP (on Isabelle for now but I plan to expand). Always feel free to reach out :-)

thanks, will try to coordinate with my collaborator - maybe we can whip up some kind "Coq for AI researchers" pitch

with that said, I also heavily use HOL4 and CakeML [for PL/FM research]

What's the status of HOL4 and isabelle integration at the moment? I was at a Dagstuhl seminar some years ago where people were working on this.

I'm not aware of much that is practical. CakeML has a verified checker for https://www.gilith.com/opentheory/

Bas Spitters said:

What's the status of HOL4 and isabelle integration at the moment? I was at a Dagstuhl seminar some years ago where people were working on this.

The Isabelle people are mostly interested in the CakeML project: https://www.isa-afp.org/entries/CakeML_Codegen.html. Having a verified code generator or perhaps one day running Isabelle on top of CakeML would be extremely exciting. Nevertheless, things are still in progress as far as I know...

for POPLMark there was the opposite of many modern benchmarks: tons of different solutions in Coq, not so many for other ITPs. One might speculate the target domain drives the enthusiasm

@Karl Palmskog : the issue with running the script in CI is that currently the CI builds are not complete (it would take > 6 hours on the free GitHub machines). But I can do it as part of my (local) release builds.

just running stats for release builds would be fine by me, any LOC numbers that are reproducible and can be easily referred to is good [for marketing]

another PL-centric benchmark for Coq/Isabelle comparison:

The latter is still missing pretty much the entirety of chapter 4, unfortunately, I still hope to come back to it one of these months :)

One issue is that now I also need to port it to MC2 :)

I count only 20 simple instances so I'd expect the port to be very easy (theporting tutorial should literally cover almost all cases). Also note that there is no hurry, we will keep maintaining MC1 for some time.

Karl Palmskog said:

another PL-centric benchmark for Coq/Isabelle comparison:

Interesting. what's comparison between those in LOC? Hoping there's a possibility to isolate ch4

Bas Spitters said:

Interesting. what's comparison between those in LOC? Hoping there's a possibility to isolate ch4

You'd need to collect all the background AFP items they refer to, since they typically pick out subsets of their previous data structure formalizations for individual (sub)chapters.

e.g. section 5 refers to these:

http://isabelle.in.tum.de/library/HOL/HOL-Data_Structures/Tree_Set.html

http://isabelle.in.tum.de/library/HOL/HOL-Data_Structures/Tree_Rotations.html

http://isabelle.in.tum.de/library/HOL/HOL-Data_Structures/Interval_Tree.html

The whole book essentially functions as a compendium of these formalizations they have been doing for many years, I think this paper was a precursor to the book:

https://www21.in.tum.de/~nipkow/pubs/atva20.pdf

Also a head-on comparison wouldn't be entirely fair since I didn't use any automation beyond the reflection machinery for basic data structures like lists as a sort of a learning exercise, I'm pretty sure there are many places that can be collapsed into `lia`

, `ring`

and friends.

Emilio Jesús Gallego Arias said:

Coq does not seem in focus for these research groups. I'm not entirely sure why?

Bas Spitters there is a pretty nice draft https://dependenttyp.es/pdf/repairdata.pdf that outlines some of the problems Coq has, I think that the main problem is that the required infrastructure is just not a priority for the team / lack of resources to work on it (both in a sense are they same)

Many of the suggestions seem to be about package managers and large archives. How much of this is still relevant when taking into account the work done by coq platform/coq community?

no matter how well we do with the Coq Platform and Coq-community, there will be a significant chunk of Coq projects that are not immediately reusable or conveniently available (research artifacts and the like) that do their own thing w.r.t. tooling, compatibility, etc. For example, Coq-community aims for reuse, so we spend time on builds/CI/releases/documentation to this end. If you look at places like Zenodo, all you may get is a VM and a small readme.

I don't think this will change anytime soon, and the improvements for Platform and Coq-community projects to packaging and build processes are gradual over many years...

But didn't we agree that the size of platform would already make a great training set? I'm surprised it's not mentioned in the article (@Talia Ringer ?)

If people start training LLMs on platform, then that might also become a reason for people to include there software there.

I agree that the Platform would make a good basis for training most machine learning applications. However, it's not comparable to Isabelle's centralized AFP, since they include every kind of library (both reusable libraries and research applications such as specific math). We will likely not ever do anything resembling AFP.

the Platform paper compares in detail in AFP (p. 8)

including something in the Platform requires making a commitment to maintain it, including something in the AFP means someone else will maintain it (minimally)

How big is the difference between platform and Coq CI at the moment? I would imagine that the whole of platform would be good to have in CI.

Also, are there any generally useful packages in CI which should (morally) still be included in platform?

I think there are good reasons for some Platform projects to not be in CI, since they may want to adapt to breaking changes at their own pace, and be compatible with several Coq versions. Also, MathComp has its own CI that tests Multinomials and some others not in CI.

In Coq-community, I can see RegLang, CoqEAL that are in Platform but not in CI (but are in MC CI)

to give an idea to others reading this what is *not* in the Platform and how this can sometimes be addressed without adding new Platform projects (e.g., maintainers do not want to give commitments or contents do not fit the Platform scope), we have a general "result consolidation" issue: https://github.com/coq-community/manifesto/issues/143

I notice that a lot of the iris eco system seems to be missing. E.g. wasm-iris, Melocoton, aneris, ... I'll ping me colleagues ... They might be in other places.

I think it would not be that much more effort to have a substantially larger platform which is then always a release behind - for many applications this doesn't matter much. In the past most of the platform maintenance work was OS platform specific and not package specific.

The main thing we need is a different test infrastructure, but this we already for Coq Platform as is - I can't do full builds on GitHub any more.

@Michael Soegtrop I think it would take significant effort to curate projects and ask maintainers for tags if we go beyond "utility" projects in the Platform. I think they set up the AFP partly as a funding-attraction vehicle, so formalizations can get citations (academic Pokemon points). The incentives for doing this in the Coq community are weaker, not least through Inria's base funding for Coq development.

if you make a Coq-related project that claims to be generally useful, there is at least nominally a reason to get it into the Platform. But many formalizations of exotic math and type theory have very small chance of actual reuse. This shouldn't prevent people from curating large "snapshots" of Coq code as extensions of a Platform release, but to curate "leaf" formalizations on an ongoing basis (without strong incentives/funding) seems wasteful

Building these leaf projects to make a Windows or macOS installer in particular would probably be a waste of time, but consolidating them with compatible versions and building them all together can make sense. It seems to me that this would be a good application of the Coq universe project: https://coq.zulipchat.com/#narrow/stream/327010-Coq-Universe / https://github.com/ejgallego/coq-universe

The upcoming Leiden meeting by @Assia Mahboubi and others will also consider the topic of mathematical libraries:

https://www.lorentzcenter.nl/machine-checked-mathematics.html

Bas Spitters said:

Emilio Jesús Gallego Arias said:

Coq does not seem in focus for these research groups. I'm not entirely sure why?

Bas Spitters there is a pretty nice draft https://dependenttyp.es/pdf/repairdata.pdf that outlines some of the problems Coq has, I think that the main problem is that the required infrastructure is just not a priority for the team / lack of resources to work on it (both in a sense are they same)

Many of the suggestions seem to be about package managers and large archives. How much of this is still relevant when taking into account the work done by coq platform/coq community?

I think all of it is still relevant.

Indeed I wouldn't bother to port such developments to Mac and Windows, beyond informing people if it works or not.

But I don't think it is that much effort as long as one doesn't insist that something does work. Just informing authors what works and what not might in a good fraction of the projects be sufficient. I am not academic but I guess Coq developments also get more citations if they are reproducible on a reasonably recent Coq.

Btw.: any thoughts on my previous remark that proof assistants with an inherently executable logic might be more suitable for reinforcement learning than axiomatic systems?

I imagine the difference between axiomatic systems and CIC like systems would be like between learning fraction arithmetic with ℝ and with ℚ in Coq. Would you follow this argument?

If we were talking about systems which could carry out the execution of that logic then this would be plausible. The vast majority of research right now in reinforcement learning seems to be going into neural networks and I'm not impressed by their ability to "execute code". Some work has been done on this in neural turing machines and differentiable inductive logic programming.

@Michael Soegtrop your idea resonates with the work on plugins for LLMs. E.g. LLMs are bad at computations, but with plugins they can learn to use a CAS (such as wolfram alpha) for the answer. One could imagine using Coq as a similar CAS.

Not sure we are on the same line. My line of thought is that a reinforcement learning network is put in an environment which reacts in complex ways. Think of kids building towers of bricks and kicking them down. This must be distinguished from reasoning about or proving something about such a system - which is a second step. First the network needs some sort of "intuition" about the system. In my view the "system" could be executable Coq code the network could create like kids create towers of blocks. I wonder what the "manipulatable reactive system" would be in an axiomatic proof assistant. It looks to me like teaching the behaviour of blocks to 2 year old by explaining them about Hamiltonians.

So it is not about having an LLM say using an SMT solver to solve logic riddles - it is more about having a neural network in some sense "understand" the systems it is manipulating. For the stuff I am doing I cannot imagine how one could possibly come up with proofs without doing some concrete experiments. Well most proofs are fairly schematic and goal driven - the interesting ones are not.

@Patrick Massot Has an impressive (I understand automatic) informatlization tool for Lean:

https://www.imo.universite-paris-saclay.fr/~patrick.massot/Examples/ContinuousFrom.html

Are there any more details available? Could a similar tool work for Coq, or is it tight to more declarative languages like the one of Lean ?

closest thing I know in this direction: https://github.com/impermeable/coq-waterproof

Mathematicians unfamiliar with the Coq syntax are able to read the resulting proof scripts.

I understand Patrick starts from a regular lean proof. Waterproof defines a new proof language.

There is/was also Czar for Coq, but my question is more about the automatic informalization.

https://link.springer.com/chapter/10.1007/978-3-540-68103-8_5

until recently, Waterproof was only using custom tactics and the like, so not a proof language. But they apparently refactored it into a plugin.

I would still call this a proof language.

OK, I guess this is more in the required direction then? https://github.com/andrew-bedford/coqatoo

Generates natural language versions of Coq proofs

Not updated since 2018 though.

https://leandojo.org/ might be relevant for this discussion, and CUDW

@Talia Ringer was mentioning this paper in this week CUDW, thanks @Bas Spitters

I'd be great to see how it compares to Coq's plans

I recently noticed this thread, and I'd like to add my two cents. There are a number of projects doing machine learning for Coq. The three (publicly available) families I know are:

- The CoqGym family of projects: Including the original ASTactic prover by @Kaiyu Yang (Gitter import) as well as many extensions lead by @Emily First including TacTok, DIVA and Passport.

- ProverBot9001 by @Alex Sanchez-Stern which is a pretty good tool. It is even publicly available through the Proofster tool led by @Arpan Agrawal which is both a website and Coq plugin.

- Tactician by @Lasse Blaauwbroek. This tool can be opam installed, and uses a fast online learning algorithm so it adapts instantly to your project. I think this project is very underrated. It is certainly one of the better tools out there.

I think there are a few things currently holding Coq back as a platform for machine learning research:

- I think the reason a lot of machine learning research is done in Isabelle is not just because of the Isar style, but more so because Albert Jiang (in collaboration with @Wenda Li?) has put a lot of effort into developing tools for doing machine learning in Isabelle. These tools have the support of both Larry Paulson and machine learning collaborators in industry and in academia. I think Coq is yet to have something comparable, namely a central tool for machine learning in Coq, and maintainers who are willing to work with a large chuck of machine learning researchers to make that tool fit their needs.

- Because of the unified machine learning framework for Isabelle, it it much easier to compare to other similar works (even if the metrics are quite flawed often). In machine learning, one really wants to show that one is improving based on other works in the field. For example, in Isabelle there are two big test datasets, MiniF2F and the PISA dataset (which is a selection of Isabelle theorems not seen during training). For Coq, there isn't anything as canonical as that. So it is hard to compare projects. Even comparing ProverBot9001, Tactician, and the CoqGym family of projects is a challenge. It is not clear at all which one is better. (Although, this comparison by @Alex Sanchez-Stern is a start.) And that is just within Coq. Comparing to the vast number of projects in Isabelle (and the few projects in Lean) is basically impossible except by reimplementing them first in Coq.

- Coq's decentralization is a real challenge. Extracting data is much harder because of it. It has already been mentioned that the CoqGym project didn't really know how to work with Coq's ecosystem to get a full set of Coq data. Also, every Coq project is its own special snowflake with a different version of Coq and different idiosyncrasies and even logical inconsistencies. Even though Coq probably has the largest amount of data, extracting data is a real challenge compared to Lean and Isabelle which have one or two main libraries. It also presents a real challenge for testing, especially if you want to test on different projects than you trained on, as CoqGym does.

- There is also a chicken and egg problem (and this is not specific to Coq). These systems are pretty crappy so far, but also rapidly improving. Is it worth putting effort into making high quality interfaces for them right now? On one hand, tools which are used are then more likely to be improved and built upon. On the other, if no one wants to use it, then it is a waste of time. Ideally, these tools should be as easy to install as possible and as user friendly as possible. Every extra step is going to dissuade half the user base. Ideally they should just come packaged with Coq, like SledgeHammer is packaged with Isabelle, and like the Meta-created LLM tactic suggestion tool is packaged with Lean 3, or barring that, they should ideally be at least as easy to install as say the Github Copilot plugin. (I know I'm purposely brushing a number of complexities under the rug and this may be more of an ideal than what is possible with the current ecosystem.) I think @Avi Shinnar's talk is exactly right about how to make the user experience of these tools better.

I should also mention that there has always been a large need for non-machine learning people to get involved in this field. For example, about every few months some machine learning researcher comes on the Lean Zulip asking for how to do machine learning for Lean. I point out the existing tools, but also mention that the best case scenario is partnering with a Lean meta-programming expert. Moreover, with the vast number of public LLMs out there, some through paid APIs and some totally open source like ProofGPT, it is easier and easier for someone who knows how to interface with Coq to just hook one up. Indeed, that is what @Scott Morrison did for Lean 4 and GPT-4 with Segrado.

if no one wants to use it, then it is a waste of time

This one is not really obvious to me. Many tools developed in the Coq community are used by developers and their collaborators for many years in research, and then sometimes make their way to a larger userbase. Compare: MetaCoq, SerAPI

That's a nice summary @Jason Rute ! Thanks a lot. So far almost all ML works on Coq do rely on SerAPI (tactician does not) , but indeed SerAPI was very limited for a few variety of reasons. Now Flèche + coq-lsp is getting there, and these limitations should be mostly gone, however there is still work to do, and getting chnages upstream to Coq remains a complex process. Moreover the resources allocated to these tasks are very very low.

Also note that work on having a better unified build for Coq is undergoing in https://github.com/ejgallego/coq-universe/

it also sounds as though the Isabelle ML edge was then in large part (?) facilitated by a large, 6-year EU grant, which are usually rare in ITP circles. Hard to replicate those conditions.

Karl Palmskog said:

if no one wants to use it, then it is a waste of time

I guess I meant this almost in the tautological sense. Yes, if it eventually gets used either directly or by a later project, it could be worth the effort. Same if new ideas or experience comes from it.

But here is one example from my own work. When we built the Lean gptf tactic, it was an interesting proof of concept which called a LLM on Open AI's server to make tactic suggestions. Some people used it but it wasn't that useful to folks. Some of that not being useful was because it didn't give perfect suggestions. But some of that was because it wasn't as user friendly as it needed to be. Installing took a few steps, including requesting an API key via a spreadsheet. It couldn't be used for mathlib development since the project depended on mathlib, which frankly meant most Lean power users couldn't use it for their primary Lean workflow. Also, it didn't do any kind of search which made it way less useful than it could be. In the end, OpenAI stopped supporting it and it was shut off.

Had one put in the effort to make it a more useful tool, I could see two extremes. (1) That effort would have still been largely wasted since people still wouldn't have found it that helpful, or (2) it would have become a go-to tool, supported by the Lean community and used to drive innovation in this area.

A similar story is true of the Meta-based Lean tactic suggestion tool, which still runs. Unlike lean gpt-f, it gives suggestions automatically in the VS Code infoview underneath the goal view, without having to run a tactic, so it is much more user friendly. But the suggestions aren't even filtered to those which don't fail. So users are quickly turned off by its bad suggestions. Also, again there is no search, and the feature, while easy to turn on, is quite hidden and not at all advertised. Again, there is the question of if it is useful for someone to put the hard work in to improve that tool or not.

I think there are also other two aspects of developing non-user-attracting tools for ITPs:

- developing the tools can reveal flaws in the ITP itself, which if reported and fixed will no longer be obstacles for future devs (we did this in connection with work using SerAPI/Serlib)
- a "bad" tool can be a benchmark for "better" tools to motivate their development and publication

Karl Palmskog said:

it also sounds as though the Isabelle ML edge was then in large part (?) facilitated by a large, 6-year EU grant, which are usually rare in ITP circles. Hard to replicate those conditions.

I wonder if @Wenda Li has thoughts on this. Would the Isabelle papers like PISA, Thor, Autoformalization, Draft Sketch Prove, MagnusHammer, Baldur, Decomposing the Enigma have happened without this grant? (Of course if the grant agency is watching, then the answer is no. :smile:)

I was also basing the grant assessment on Paulson's blog post

@Karl Palmskog As for tools, I'd be curious about your own experience with Roosterize. Is it used by other Coq people? Do you feel you put too much effort into making it usable, not enough, or just the right amount?

we were working towards trying to integrate it into CI, which I think would have been the best way to expose it, but there were funding/timing issues. I think the jury on "too much effort" is still out, if an opportunity rises a lot of work might be reused and justify previous work.

there is a prototype for VS Code integration with naming hints, but I think that was too hard to use due to us using Coq "batch" mode (full document compilation), if reimplemented on top of coq-lsp or similar and made on-the-fly it could work (write a lemma statement, only the statement is sent to the model to predict the name)

I also tried to upstream all improvements we did in tooling and packaging the projects we used for training/evaluation, which are now almost all in the Coq Platform and/or in Coq-community

Jason Rute said:

Karl Palmskog said:

it also sounds as though the Isabelle ML edge was then in large part (?) facilitated by a large, 6-year EU grant, which are usually rare in ITP circles. Hard to replicate those conditions.

I wonder if Wenda Li has thoughts on this. Would the Isabelle papers like PISA, Thor, Autoformalization, Draft Sketch Prove, MagnusHammer, Baldur, Decomposing the Enigma have happened without this grant? (Of course if the grant agency is watching, then the answer is no. :smile:)

Thanks a lot for your comprehensive summaries of various ML projects on various systems, @Jason Rute . It helps a lot for everyone to keep track of the status quo.

For us, this long-term grant matters a lot -- it funds me in full and partially for Albert. Though the original goal was not heavily on ML, the length of the grant allows us to do some side exploration :-)

Admittedly, many of the Isabelle projects listed above are solely for ML research. There is still a large gap between a ML prototype and a daily usable tool (ideally locally deployed) -- we hope to develop such tools in the future. Also, for daily tool development there is always a dilemma to balance sheer performance (mostly achieved by neural networks) and efficiency (mostly done by traditional methods like handcrafted features/heuristics, kNN, decision trees, etc.). For this, I really appreciate the ENIGMA series from Josef Urban and Cezary Kaliszyk's group, which always strives to strike that tricky balance.

NYT article about AI and symbolic AI (ATP, ITP) in math. Not much new, but good to see the attention.

https://www.nytimes.com/2023/07/02/science/ai-mathematics-machine-learning.html

@Wenda Li and I are at the https://www.cl.cam.ac.uk/events/pittsposium/

The question of LLMs for ITP came up after Larry Paulson's talk.

@Emilio Jesús Gallego Arias has there been a mark-down(?) write up of the Coq tools for LLM developers?

Love your talk @Bas Spitters, BTW :-)

FYI, as a follow-up to this, we have an AI for Math resources document: https://docs.google.com/document/d/1kD7H4E28656ua8jOGZ934nbH2HcBLyxcRgFDduH5iQ0/edit?usp=sharing . I welcome comments and suggestions (directly on the doc)

Last updated: Jul 24 2024 at 11:01 UTC