GPT chat is able to generate Coq definitions (some are pretty decent), moreover it responds quite well to error messages.

https://chat.openai.com/chat

I think it can improve quite a bit with a script that hooks it up directly with the Coq type checker.

Did anyone try something like this already?

For context, GPTChat is winning advent of code...

https://twitter.com/ostwilkens/status/1599026699999404033

Success on day 3! :partying_face: 10 seconds for part 1! :star:️ Let's accelerate problem solving, and by extension the automation of the developer profession! https://twitter.com/ostwilkens/status/1599026699999404033/photo/1

- Carl Öst Wilkens (@ostwilkens)I think the consensus is that something like GPTChat can give you function definitions and inductive types, but proofs are going to be garbage (anyone that has used GitHub copilot for Coq code has probably experienced this)

Yes, but I'm surprised noone seems to have hooked it up to a type checker yet. Same for github copilot...

should the community really invest time in these APIs that can be withdrawn and changed at any time? If you have some collaborator at OpenAI, sure, otherwise I'd never get involved

I'd also ask whether this is desirable in any case. Having a tool that does the most stupid proofs is not going to help us as users of Coq. Something that suggests tactics or improves lemma search would be more worthwhile I think.

these models operate at token level, so they *can* predict/suggest tactics and do lemma search. But it requires work at the API interaction level (and some interaction with Coq) to set this up in a Coq development workflow. I expect this is what Bas is alluding to.

their main drawback is that they have almost zero knowledge of Coq's proof state, since this is not visible at Coq `.v`

code level at which they are trained

Using GitHub Copilot with OCaml development or Python notebooks has been a good experience and has enhanced my productivity, but I have made zero investment into this. And I've never tried it on Coq projects yet.

I think Coq beginners will have the most benefit of Coq copilot, per the blank screen problem (I want to write+prove gcd but don't know where to begin)

Indeed, especially if they have previous experience with Copilot and know that writing comments is a good way to guide it (and know how to write a comment in Coq of course ;-) ).

as I was discussing with some colleagues, a lot of problems when you use these general models for something like Coq turn into prompt engineering problems, e.g., writing sufficiently descriptive comments

@Bas Spitters you might be best positioned to set up an undergrad to investigate "prompt engineering for Coq" (as teaching and development tool)

I gave it a shot with a manual prompt, it generally answers the right question, but makes mistakes, e.g. https://imgur.com/a/millcPJ

Some good info and advice but too noisy to use directly I think.

@Théo Zimmermann Does copilot make any effort to ensure that its output type checks?

Karl Palmskog said:

should the community really invest time in these APIs that can be withdrawn and changed at any time? If you have some collaborator at OpenAI, sure, otherwise I'd never get involved

At the moment, I'm just playing, trying to see whether I should be worried that my students will start using it.

I guess more people are playing with it, so I was curious after experiences.

No, but it integrates well with other IDE features (like one that auto-type checks your code and one that auto-formats it). Then, you have to manually inspect and fix the mistakes.

Manual review of generated code is a must anyway.

Anton Golov (they/them) said:

I gave it a shot with a manual prompt, it generally answers the right question, but makes mistakes, e.g. https://imgur.com/a/millcPJ

Some good info and advice but too noisy to use directly I think.

Wow, it is so wrong and yet it looks so right.

Oh, I missed it was saying that left and right select subgoals here.

Yeah, it's good at being very confident regardless of whether it's correct or not.

It knows a little about type checking ocaml...

https://twitter.com/AlainFrisch/status/1599473197656600577

This is real. #ChatGPT #OCaml https://twitter.com/AlainFrisch/status/1599473197656600577/photo/1

- Alain Frisch (@AlainFrisch)And it also recommends `[ all: ... ]`

. And it also says that `done`

is failing to solve the subgoals when actually Coq just reports it's not applied to the right number of goals (it should be `; done.`

or `; [ done .. ].`

.

Yeah, I just thought it had snuck one good answer amongst the rest :')

State of the art in in NN guided proof search seems to be Facebook's paper here, HyperTree Proof Search for Neural Theorem Proving. This paper falls under the category of "use the neural network to predict exactly the next step in the proof". (The other end of the spectrum is "Given that we want to prove theorem A, use machine learning to retrieve the top twenty lemmas most likely to be useful for A from the thousands of lemmas in the standard library. Then use standard automation on those twenty lemmas." )

Google also has a recent paper, HOList: An Environment for Machine Learning of Higher-Order Theorem Proving, and OpenAI published Formal Mathematics Statement Curriculum Learning.

Cc @Talia Ringer

It does really badly at Coq definitions proofs about things it hasn't seen before from what I've seen. Like ask it to write the addition function over binary natural numbers

But, I do have hope this general class of technologies will be useful for helping people formalize things in the reasonably near future. I just don't think GPTChat displays any new functionality on that end we haven't already seen from other language models

SOTA moves exceptionally quickly right now, I recommend looking at workshops in this area for the most recent work

AITP: http://aitp-conference.org/2022/

Just this past weekend, MATH-AI: https://mathai2022.github.io/

We also have a web interface for a Coq machine learning proof synthesis tool (no NL handling, though it'd be neat to chain it with autoformalization tools to try that) but I need to ask my research team how much of a load it can handle before I go around sharing the link to it lol

here's our demo if anyone is curious: https://proofster.cs.umass.edu/ builds on Proverbot9001 and Alectryon right now

@Talia Ringer Interesting!

What kind of automation is Proverbot using? Would it help to reuse tactics snipe or ideas like rippling (this was popular at some point, but seems to have dissappeared)

Proverbot does neural network tactics generation. One key limitation of its approach (from paper):

[I]nstead of trying to learn and predict (;) directly, Proverbot9001 has a system which attempts to desugar (;) into linear sequences of proof commands.

An obstacle with tactic prediction/generation for Coq is that the tactic engine is very complicated compared to any HOL system. For example, `;`

can generate a bunch of evars shared among goals, and I haven't seen any machine learning model for tactic generation that takes this into account.

should anyone not human even be writing these tactic calls anyway? Might be better to actually ask for (generate) the proof term

Do you think that the lean language would be more suitable for this? As it is closer to isabelle?

I see the issue on rippling is still open https://github.com/coq-community/manifesto/issues/57

We chatted about it here: https://coq.discourse.group/t/inductive-proof-automation/226/2

I always keep away from any machine learning + proof generation, so no idea if Isar or Lean proofs are intrinsically better to generate with neural nets.

Things like rippling are in my view much more well-understood (domain-specific automation at "logic level"). What one might be interested in is how well a machine learning model might do if it gets access to a powerful proof procedure or to domain-specific automation. For example, TacticToe for HOL4 does even better when combined with the E prover.

Just a small plug for the stream https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving on the Lean Zulip. Unfortunately, there is not a clear central place for discussing these matters, but a lot of momentum has been built up on the Lean Zulip. It is not just Lean folks. For example, Albert Jiang, who works on AI for Isabelle, regularly comments there. (Arguably Twitter is the most common place to talk about this stuff, but I find Twitter hard to use, and maybe too public.)

But if folks here want to talk about it more, I'd be glad to participate. I think Coq folks sell themselves short. There is no real barrier in my opinion to the same sort of tools which have been developed for HOL-Light, Metamath, Isabelle and Lean to also be developed for Coq (or Agda, etc). Indeed that is what has been done already with ASTactic (and its decedents TacTok, Diva, and Passport), Proverbot, and Tactician.

An obstacle with tactic prediction/generation for Coq is that the tactic engine is very complicated compared to any HOL system. For example, ; can generate a bunch of evars shared among goals, and I haven't seen any machine learning model for tactic generation that takes this into account.

Just a note, Lean GPT-f (a.k.a. the PACT paper) and I think HTPS both learn to generate Lean tactics with `;`

inside.

Do you think that the lean language would be more suitable for this? As it is closer to isabelle?

I've always thought of Lean's tactic language as closer to Coq than Isabelle, but maybe I'm mistaken.

like Coq, Isabelle has several ways of writing proofs. But Isabelle's Isar language (which I think is the overwhelmingly prevalent way to write Isabelle proofs) is quite far from both Coq Ltac1 and Lean 3.

as I probably have expressed elsewhere, if we want production level proof automation based on learning, it seems far from obvious that a model for dependent type theory should even learn from these tactic scripts like in Ltac1, Ltac2 or Lean 3/4. Tactic-based proving has been described as an extreme form of aspect-oriented programming, and at least for Ltac1 the semantics is complicated (accidental?). Might be better (for Coq, at least) to design an intermediate more suitable language between raw terms and tactic scripts.

Currently when I look into neural networks I see formal languages represented quite similarly to natural languages, you come up with an encoding that represents individual tokens as vectors in R^n and then you use a recurrent neural network to remember the sequence of tokens. To me this seems inadequate as a modelling paradigm because if you were to ask a mathematician to describe a formal language then they would not generally think of it as a sequence of tokens but rather as a tree. I have seen a few tree based approaches to neural networks, are there currently any big approaches that more realistically tackle the problem as recursively trying to fill in child nodes?

(My question is here directed at term synthesis in any formal language not just Coq)

Karl Palmskog said:

as I probably have expressed elsewhere, if we want production level proof automation based on learning, it seems far from obvious that a model for dependent type theory should even learn from these tactic scripts like in Ltac1, Ltac2 or Lean 3/4. Tactic-based proving has been described as an extreme form of aspect-oriented programming, and at least for Ltac1 the semantics is complicated (accidental?). Might be better (for Coq, at least) to design an intermediate more suitable language between raw terms and tactic scripts.

I agree that the accidental complications in Ltac1 are rather unfortunate for machine learning. With Tactician, a lot of time is spent just working around the complexity of Ltac1 and bugs that are associated with it. A simpler language would be great to have. However there are many practical problems that keep me from working on something like this:

- Machine learning models rely on having data in order to learn. If we write a custom proving language, we lose the advantage of Coq being a real-world system with lots of data. It just becomes a 'toy' experiment at that point.
- A major advantage of using a well-established tactic language is the ability to exploit domain-specific automation written in Ltac. Like you mention, Ltac has lot's of accidental complexity and quirks, but on the other hand the decision procedures and heuristics people have written are invaluable.
- One could consider just using Gallina as a language. That way at least you will have mountains of data. But expecting machine learning models to manually write terms with complex pattern matching, induction principles, higher order predicates and similar constructs is currently just beyond its capabilities.

Patrick Nicodemus said:

Currently when I look into neural networks I see formal languages represented quite similarly to natural languages, you come up with an encoding that represents individual tokens as vectors in R^n and then you use a recurrent neural network to remember the sequence of tokens. To me this seems inadequate as a modelling paradigm because if you were to ask a mathematician to describe a formal language then they would not generally think of it as a sequence of tokens but rather as a tree. I have seen a few tree based approaches to neural networks, are there currently any big approaches that more realistically tackle the problem as recursively trying to fill in child nodes?

For the record, the machine learning people have lots of models that can be trained on trees directly. It just turns out that neural sequence-input models often perform better for many tasks. (At least, this is what I've heard from machine learning expert collaborators.)

@Lasse Blaauwbroek for a view of what pure Gallina proofs look like, see here for example: https://github.com/llee454/functional-algebra

Doesn't look impossible to learn/generate to me. But the advantage of an intermediate (non-Ltac) language is that you can handle more proof automation like `lia`

, etc.

one interesting proof engineering question is whether all the custom Ltac that people write for domain-specific automation is actually good, or, as I suspect, a lot of it is mediocre and could just be replaced by a combination `lia`

and friends on one hand, and `sauto`

/`itauto`

on the other.

Not impossible indeed. But it should be noted that these proofs are incredibly neatly written. Most Gallina proofs don't even look close to this. So the amount of such neatly written proofs is quite limited (a problem for machine learning). But even with these neat proofs I think it would be quite a challenge to make this work. The problem is interacting with the type-checker. With these kind of higher order predicates, you have to submit quite a large chunk of code with holes in them before the typechecker will give you feedback (otherwise the unification algorithm tends to fail).

In an ideal world, Ltac would have been designed with an intermediate layer that does away with all of the syntax sugar and similar constructs. Unfortunately we don't live in that world, and it is quite difficult to retrofit such an intermediate layer (Tactician does have work in that direction).

Karl Palmskog said:

one interesting proof engineering question is whether all the custom Ltac that people write for domain-specific automation is actually good, or, as I suspect, a lot of it is mediocre and could just be replaced by a combination

`lia`

and friends on one hand, and`sauto`

/`itauto`

on the other.

It depends on the development. But for deeply embedded projects like Iris, the custom auomation is certainly indispensable. I don't think their tactic layer can be replaced with just `sauto`

and certainly not `lia`

.

With these kind of higher order predicates, you have to submit quite a large chunk of code with holes in them before the typechecker will give you feedback

Although I suppose it would be interesting to put something like ChatGPT in a conversation with a typechecker so that it iteratively resubmits a code fragment and looks at the error message of the checker to refine the fragment until it is accepted.

It's true that a custom proving language would lose the advantage of Coq being a real world system with lots of data. However if you could come up with a good way to generate random theorems and their proofs (synthesize arbitrary terms) then this might be a partial corrective

I think one high level thing that is missing from this discussion is what does the user wants from AI proof automation. I see (broadly) two goals:

- An ATP that generates Coq proofs, in any form (Ltac, Gallina, intermediate language, alien script) as long as it successfully generates a term proof in Coq in the end. In this way, all we are asking for is something that proves theorems. Like CoqHammer, one probably doesn't want to leave the AI in the proof as a tactic, so one must replace the AI call with a tactic witness. I don't know a lot about CoqHammer, but at least the Isabelle SledgeHammer works, as I've been told, by generating fairly opaque looking tactics which one pastes into their code. If the proof changes and needs to be repaired then one reruns Hammer instead of directly repairing the proof.
- A system which generates human-level proofs or proof suggestions which are ready to be pasted into code. Such a system doesn't just generate a proof, but one that ideally is well-written, easy to maintain, and would pass code review. It would be more useful for those learning Coq and wouldn't require using the proof-generating system for repairing proofs. One the extreme end, one could imagine a tool which doesn't just generate proofs, but also definitions, lemmas, and entire chunks of code (similar to the family of GPT products---Codex, Copilot, and ChatGPT---but more suited for Coq). For example, Talia has talked a lot about the need for a broader perspective here.

I think most AI tools currently fall in the middle. They do some cleaning up of data, decomposing tactics or training on both kernel and tactic proofs, so in the end you get something human like, but still more machine-like than a human would generate. But I'm not sure what users actually want most. (Or if they want any of this even if it is reasonably good.) But one should note that custom intermediate languages and other such ideas would likely focus more on the first objective than the second, which is ok, but should be noted.

@Jason Rute Thanks for the plug about lean zulip! I've spend a bit of time browsing. I'll dump it here, in case anyone is interested:

Auto formalization benchmarks in lean: https://paperswithcode.com/sota/automated-theorem-proving-on-minif2f-test

https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/More.20papers.20on.20autoformalization/near/312985378

LeanChat vscode plugin of GPTChat for lean:

https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/Lean.20Chat.20is.20publicly.20available

Thor (should also work for Coq): https://arxiv.org/abs/2205.10893

A Survey of Deep Learning for Mathematical Reasoning https://arxiv.org/abs/2212.10535

This Thor paper is actually quite disappointing. I am sorry but the whole content of this paper seems to be "What if we added the token 'sledegehammer' to the vocabulary of the large language model?" That's not a very sophisticated integration of the two techniques. Their "Algorithm 1" is the central idea of the paper and it is 5 lines.

The authors of the paper repeat at multiple points that they are using coqhammer for premise selection but Coqhammer literally has a built-in keyword `predict n`

where it will immediately return the theorems most likely to be useful, this does not involve running the theorem prover for 60 seconds. In my opinion the authors are essentially contradicting themselves here, because if their real goal was to leverage the software for premise selection, they would have done this... but there is no mention of this off-the-shelf feature. The phrase "premise selection" occurs 42 times throughout the paper, but what do they mean by this?

(Yes I realized they used Isabelle's sledgehammer, but there must be some similar functionality there.)

OTOH, their approach to premise selection fits how hammers actually work (EDIT: AFAIK), and that simple idea is already pretty effective, according to the abstract; the nontrivial contribution is the evaluation...

I'm not a hammer expert, but the key idea (AFAIK) a premise lemma seems useful if an ATP can complete the proof with it, and if it seems useful we can feed it to proof reconstruction, even if proof reconstruction is even slower. From the Coqhammer docs, `predict n`

seems an even earlier low quality phase, used to feed ATPs

Having taken a closer look, I see good points on both sides: more sophisticated integration is conceivable, but algorithm 1 has a clever idea even if it seems straightforward

- while the experiment in Sec. 4.4 is useful, the description requires generous reading: there, "premise selection" really means "computing the lists of lemmas passed to proof reconstruction tactics". And is training on the original inputs really learning
*how*to do such "premise selection"? One of the "how"s, for humans, is calling sledgehammer! - that's what's clever about algorithm 1 is that it *replaces" some proofs by sledgehammer, but that will bring some proofs closer to what humans actually typed!

for everyone's record, since people are interested in chatgpt there also better solution (in terms of ethical issues with respect to training data) and being free (as in freedom as long as you have the hardware to run it)

There is bloom which is arguably on par with chatgpt:

https://huggingface.co/docs/transformers/model_doc/bloom

Key difference is that unlike chatgpt bloom has no good marketing department (as with any other foss project)

I would say Bloom is closer to the original GPT-3 than to ChatGPT. In particular Bloom is not instruction fine tuned. Instead, like the original GPT-3, it is just trained on next word prediction using a large segment of the internet. So if you give it “Write me a love poem.” as a prompt, you are more likely to get it to continue the instruction than to write the live poem. Of course, like the original GPT-3, you can make clever use of prompting to encourage the model to do what you like, but even then it has been shown that instruction fine tuned models do better on a lot of tasks even when prompting is still used. The only public instruction fined tuned model I know is Flan-T5. (But that uses a more supervised form of instruction fine-tuning where a small percent of the training data includes instruction prompts. It does not use reinforcement learning from human feedback like ChatGPT.)

Last updated: Dec 05 2023 at 05:01 UTC