Hi folks, I asked ChatGPT to produce a group and monoid in Coq, that's what it got back to me XD:

```
Require Import mathcomp.ssreflect.ssreflect.
Set Implicit Arguments.
Unset Strict Implicit.
Unset Printing Implicit Defensive.
Structure Monoid : Type :=
{
carrier : Type;
op : carrier -> carrier -> carrier;
id : carrier;
associativity : associative op;
identity : left_id id op;
}.
Structure Group : Type :=
{
base : Monoid;
inv : carrier base -> carrier base;
inverse : forall x : carrier base, op (op base) (inv x) x = id base /\ op (op base) x (inv x) = id base;
}.
Coercion MonoidMixin (g : Group) : Monoid := g.(base).
Notation "1" := (id _).
Notation "x * y" := (op _ x y).
```

Not bad at all!

did you use "SSReflect" "MathComp" prompt engineering? Maybe even Hierarchy Builder...

Yes I asked it to do using math-comp style

First time I got actually what is a good suggestion for syntax XD

```
Structure Monoid :=
{
carrier : Set;
op : carrier -> carrier -> carrier;
id : carrier;
associativity : forall x y z : carrier, op (op x y) z = op x (op y z);
identity : forall x : carrier, op id x = x /\ op x id = x
}.
Structure Group := Monoid & {
inv : carrier -> carrier;
inverse : forall x : carrier, op (inv x) x = id /\ op x (inv x) = id
}.
```

Typescript-style XD

but looks like I would love to write this in Coq

seems weird that carrier is `Set`

?

Now writing some instances:

```
Definition IntAddMonoid : Monoid :=
{|
carrier := Z;
op := fun x y => x + y;
id := 0;
associativity := Zplus_assoc;
identity := Zplus_0_l;
|}.
Definition IntAddGroup : Group :=
{|
base := IntAddMonoid;
inv := fun x => -x;
inverse := fun x => (Zplus_opp_diag_l x, Zplus_opp_diag_r x);
|}.
```

Karl Palmskog said:

seems weird that carrier is

`Set`

?

Yes, after all ChatGPT has no clue about what it is doing XD

But I'm adding ChatGPT support to next coq-lsp release, we will see

Happy to add support for other people's systems by the way

isn't it better to hook into Copilot? nearly same model?

No idea @Karl Palmskog , happy to hook into whatever people think is fine

Most interesting for me is the button I just added "explain this proof"

It is not something meant to be serious, but to let people have fun and get ideas

I liked the "suggestion by tabbing" they did in the Copilot extension for VS Code. I really don't like services with signup, but this one was via GitHub, so one barrier less.

Yeah, def i don't like services with signup either

It seems to me ChatGPT is very good at following orders, but totally clueless about proofs

right, its training data has zero proof states

as for "following orders", I'm not sure the orders are exactly *your orders* as seen for example here

Interesting, thanks. So far whenever I have suggested it to take a correction on a proof, it has done more or less Ok, but indeed, it is just my very limited testing for a few weeks, that's why I want to add it to a wider interface

Asking it to explain the infinitude of primes proof in math-comp:

The statement Lemma prime_above m states that for every integer m, there exists a prime number p such that m < p.

The proof starts with the line have /pdivP[p pr_p p_dv_m1]: 1 < m! + 1. This uses the havetactical, which is used to introduce a new subgoal. The subgoal is to prove1 < m! + 1. The /pdivP[p pr_p p_dv_m1] part says that if the subgoal can be proven, then there exists a prime number p that divides m! + 1, and the pis stored in the variablep, and the proposition that pis prime is stored in the variablepr_p, and the proposition that pdividesm! + 1 is stored in the variable p_dv_m1.

The next line by rewrite addn1 ltnS fact_gt0 uses the by tactical to prove the subgoal 1 < m! + 1. The proof works by first adding 1tom! and then applying the ltnSlemma, which says thatn.+1 > nfor any integern. The lemma fact_gt0` states that the factorial of any positive integer is positive.

The next line exists p => //; says that there exists a prime number p such that m < p. The => // part says that the proof of this proposition is trivial, so it can be omitted.

The next line rewrite ltnNge; rewrites m < p to ~ (p <= m).

The next line apply: contraL p_dv_m1 => p_le_m applies the contraL lemma to the current goal, with the assumption p_dv_m1, to get a new goal that is the negation of the original goal. The => p_le_m part introduces a new hypothesis p_le_m, which states that p <= m.

The last line by rewrite dvdn_addr ?dvdn_fact ?prime_gt0 // gtnNdvd ?prime_gt1 finishes the proof. The dvdn_addr lemma states that if a divides b and c, then a divides b + c. The dvdn_fact lemma states that p divides m! if and only if p divides m. The prime_gt0 lemma states that any prime number is positive. The gtnNdvd lemma states that if p > n and p divides n, then p = n. The prime_gt1 lemma states that any prime number is greater than 1. The proof works by using these lemmas to conclude that if p <= m, then p cannot be a prime number.

Finally, the Qed command ends the proof.

The original proof for reference:

```
From Coq Require Import ssreflect ssrfun ssrbool.
From mathcomp Require Import eqtype ssrnat div prime.
(* A nice proof of the infinitude of primes, by Georges Gonthier *)
Lemma prime_above m : {p | m < p & prime p}.
Proof.
have /pdivP[p pr_p p_dv_m1]: 1 < m`! + 1
by rewrite addn1 ltnS fact_gt0.
exists p => //; rewrite ltnNge; apply: contraL p_dv_m1 => p_le_m.
by rewrite dvdn_addr ?dvdn_fact ?prime_gt0 // gtnNdvd ?prime_gt1.
Qed.
```

urgh, that's like a reading of the SSReflect lines with `Check`

lookups

I guess for people that don't know ssr syntax could be helpful?

Oh but indeed I've just realized in coq-lsp I have tons of dynamic info at hand, so I can generate a fairly interesting prompt

Could be a paper there XD

ah, but this is the problem, how do you do transparent/reproducible evals of nontrivial GPTChat applications unless you have an OpenAI co-author?

I guess you get an OpenAI co-author?

you can of course do experiments like this, but this is different from low-level tool evals.

For me ChatGPT helps because I can use it to test an interaction model of writing Coq proofs that I'd like to support well, independently of the engine. And of course, it can be useful to users.

based on the state of this repo, not sure they have many theorem proving people left...

Karl Palmskog said:

based on the state of this repo, not sure they have many theorem proving people left...

Indeed I'm not sure either; some key people in that area left OpenAI

this one also seems abandoned: https://github.com/openai/lean-gym (but probably there are living forks)

Tho the possibility to do studies exists, as `coq-lsp`

can almsot chat to ChatGPT and can almost run in `github.dev`

/ `vscode.dev`

so certainly you only need to recruit users, and have them clone a repos.

if all you aim for are user studies success, sure. But this limits scope quite a bit.

I'm not sure what OpenAI is doing, however I'm not sure either that the states of these repos means a lot. If they are working on some new stuff you will see it when it is released.

The explanation is pretty good but I guess ChatGPT got wrong contraL, ltnS, and I think the end's proving that p divides m! so it can't divide m! + 1?

And I wonder if the explanation overlaps with that on https://jscoq.github.io/wa/, which Google seems to rank pretty high whenever I google these lemmas... OTOH, I don't actually see much in common?

these state-of-the-art SSR proofs are one thing, but I'd like to see GPTChat make any sense whatsoever out of Coq 8.4 era semi-automated proofs... those are pretty accurate to describe as write-only

but I guess this is the main rift between Isabelle users and Coq users: "proofs must be understandable" (math proofs) vs. "proof scripts are doing tedious stuff and there's often not much value in looking at them" (PL proofs)

Karl Palmskog said:

these state-of-the-art SSR proofs are one thing, but I'd like to see GPTChat make any sense whatsoever out of Coq 8.4 era semi-automated proofs... those are pretty accurate to describe as write-only

Indeed here something like `coq-lsp`

can query intermediate goal state and put that in the prompt ;)

Emilio Jesús Gallego Arias said:

Most interesting for me is the button I just added "explain this proof"

@Emilio Jesús Gallego Arias

Could we have a button explain this error message?

Skimming the ML part of this Zulip, its looks like there are some projects trying to use the public GPT API to prove mathematical theorems or suggest next steps. Is there anyone working on a public project to do a more deep integration of proof assistants with LLMs, i.e. looking for customized ways to train LLMs for this usecase? It seems the OpenAI co-founder himself seems to have a lot of interest in proof assistants, especially improving proof automation for code verfication and he seems to understand FM in good depth: see timestamp 25:30-30:00 of this video https://twitter.com/i/broadcasts/1lPKqbEWzaMGb

https://github.com/lupantech/dl4math

As for papers using the GPT API this paper just dropped: https://arxiv.org/abs/2310.04353

It uses GPT-4 and compares against ProverBot9001 on Coq (Compcert). It also compares against ReProver on Lean (MiniF2F). There is an extensive discussion on the Lean Zulip here: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/COPRA

It makes me a bit more optimistic about the future of AI when I see papers on integrating these chatbots with external databases.

If you want the tool to be able to work with humans in the sense of integrating new theorems and definitions as the human proves them, it needs to be able to learn the meaning of new symbols without expensive training.

I wouldn't want to use a tool if it wasn't able to prove theorems using symbols that I defined.

And it's necessary for the reinforcement learning problem of learning a whole library by advancing through the theorems step by step, I think, which is one of the most interesting and difficult problems. But to do this you need some kind of fine grained control over what the machine is allowed to know (things already proven) and what it's not allowed to use (things later in the data set that depend on this). There are probably simpler hacks to do this, I don't know, but intuitively I would be suspicious of any kind of "reinforcement learning" paper that trained on the whole set of theorems in some way, so that the advanced theorems were already known to the machine, even if they blocked the machine from using these results. A more honest approach imo is that the machine is only allowed to learn symbols as they are introduced and it has access to an external database at any time that it can use to look up the meanings of symbols.

@Patrick Nicodemus I generally agree with you, but there is also a possible application of reinforcement learning where an AI tries and learns from proofs of theorems in any order. That is in the development of a big project. Many projects (e.g. Flyspeck, Liquid Tensor Experiment, Sphere Inversion, etc) first outline the theorems and lemmas to be proved and then start filling in the proofs. A reinforcement AI agent could (say every night) attempt to prove the remaining unproved lemmas (as well as reprove existing lemmas) and learn from any successful proofs.

This is a very similar setup to many RL for theorem-proving papers (e.g. "Learning to Reason in Large Theories without Imitation" and "Reinforcement Learning of Theorem Proving").

@Jason Rute I haven't read the paper yet, but do they explain why they restrict to compcert? Only because it's the benchmark that ProverBot was using? IIUC Coq should be an attractive platform for LLMs because it has the largest library of formal proofs, but apparently noone is using that.

from what I recall, the original ProverBot9000 paper was indeed restricted to CompCert for evaluation, so I guess they limit to where a comparison is possible

@Bas Spitters Yes, the ProverBot9001 paper only tests on CompCert. There is another unpublished evaluation of ProverBot9001 on the same packages used in the CoqGym benchmark. cc @Alex Sanchez-Stern

as to "largest library of formal proofs": we have on one hand the Coq opam repo and on the other hand Platform. But:

- the opam repo is by nature a moving target and is not guaranteed to build all packages at any time
- the Platform mostly contains generally-useful packages (but does guarantee build and provides tags)

Unfortunately I don't see this situation changing any time soon. What one can do is curate "extensions" of the Platform releases for machine learning, but this needs to be done continually.

we have a bunch of packages pending to be added to the Platform, but since they are "general utility", probably not enough to attract machine learning interest

Hmm what about Coq community? Would that be a good benchmark set?

quite a few Platform packages are hosted in Coq-community (see :rooster: symbol on website). While some of the remaining packages are high profile (:star: symbol) and would fit in a machine learning benchmark, we also have experimental packages that probably should not be used, and may not be compatible with latest Coq version.

so the best approach would probably still be: take Coq Platform as base, and manually curate additional packages from the Coq opam archive (the highest quality Coq-community ones such as fourcolor are already there)

Ah, for some reason I thought that new releases of Coq were tested against coq-community to make sure those are always compatible.

a subset of the projects in Coq-community are part of Coq's CI and are tested in every commit to Coq, which is yet another distinct set of projects (not all of them packaged in the Coq opam repo): https://github.com/search?q=topic%3Acoq-ci+org%3Acoq-community&type=Repositories

Huh, that link doesn't work for me, it just shows empty search results

I changed the link, try again now

Ah yeah that works. 13 projects that are used in CI from coq community, good to know. That might be a useful benchmark set for one of my projects at some point.

See also https://github.com/coq-community/manifesto/#position-in-the-coq-ecosystem for the difference between coq-community, Coq Platform, and Coq's CI projects.

Thank you for all your feedback. Please feel free to ask any questions related to our work https://arxiv.org/abs/2310.04353

Bas Spitters said:

Jason Rute I haven't read the paper yet, but do they explain why they restrict to compcert? Only because it's the benchmark that ProverBot was using? IIUC Coq should be an attractive platform for LLMs because it has the largest library of formal proofs, but apparently noone is using that.

We restrict to CompCert because we had a comparison available from Proverbot. Other than that we have monthly quota restrictions from OpenAI on amount of usage and GPT-4 is relatively costly. So we couldn't run on other benchmarks.

@Amitayush Thakur Thanks! How hard is it to run on new benchmark suites? math-comp comes to mind. Maybe you still have a bit of this months budget left :-)

@Amitayush Thakur Where did you get the data to do the comparison with ProverBot on Compcert? (For example, the numbers and theorem lists needed to make your plots.) Did you rerun the ProverBot benchmark, or get the data from the ProverBot authors? I'd be interested in accessing this data? (I'd also be interested in the data for ASTactic, DIVA, TacTok, and Passport as well, but I think at least the ASTactic data is just lost forever.)

Bas Spitters said:

Amitayush Thakur Thanks! How hard is it to run on new benchmark suites? math-comp comes to mind. Maybe you still have a bit of this months budget left :-)

It is not hard to run, our tool is open and anyone can run given they have OpenAI account with access to GPT-4 (https://github.com/trishullab/copra). Some restrictions exists because we use SerAPI and CoqSerAPI which doesn't have a support beyond Coq 8.15

Jason Rute said:

Amitayush Thakur Where did you get the data to do the comparison with ProverBot on Compcert? (For example, the numbers and theorem lists needed to make your plots.) Did you rerun the ProverBot benchmark, or get the data from the ProverBot authors? I'd be interested in accessing this data? (I'd also be interested in the data for ASTactic, DIVA, TacTok, and Passport as well, but I think at least the ASTactic data is just lost forever.)

One can run Proverbot9001 from their GitHub repository. However, thanks to @Alex Sanchez-Stern (author of Proverbot) who shared the results from the historical runs of Proverbot, we didn't have to re-run Proverbot. The results shown in the Proverbot report are very detailed (https://proverbot9001.ucsd.edu/reports/2019-11-20T18d38d02-0700+cd762eb9e7e6e44153bd766654727a36a3dcad0b/report.html)

Amitayush Thakur said:

Bas Spitters said:

Amitayush Thakur Thanks! How hard is it to run on new benchmark suites? math-comp comes to mind. Maybe you still have a bit of this months budget left :-)

It is not hard to run, our tool is open and anyone can run given they have OpenAI account with access to GPT-4 (https://github.com/trishullab/copra). Some restrictions exists because we use SerAPI and CoqSerAPI which doesn't have a support beyond Coq 8.15

@Emilio Jesús Gallego Arias Any insights on this. I know you've been thinking about how Coq should interface with LLMs.

Hi @Bas Spitters , thanks for the ping!

I think porting `CoqSerAPI`

to newer Coq versions should be very easy, if not automatic; indeed I understand that they are on 8.15 out of practical reasons, moving to 8.16 doesn't add a lot of value for them I think, but Alex is the authority on CoqSerAPI. Alex gave me a very nice and detailed feedback on the improvements they'd need, so that's pending.

Regarding tooling, indeed the tooling is in poor state, specially when compared to other systems. I'm unsure how to improve the tooling, in the technical side it is for me very clear what to do; but I feel there is for example large developer "cultural" resistance to unifying Coq developments in a more uniform way so they can be consumed as a dataset, using more advanced tools in the core of the system (say PPX), etc... Moreover, as far as I know there is not project-wide AI strategy; I tried to do a working group time ago but it didn't pick up, and my student working on the topic left for a startup (crazy turnover lol).

Depending on the kind of interaction you want to do, coq-lsp could make your life very easy, at the cost of writing a plugin, happy to help if there is a more concrete problem.

@Lasse Blaauwbroek also has very nice tooling in tactician, I'm unsure how many hacks tactician is using now, but in principle I took all of Lasse's feedback into account so hopefully he'd have a nicer time.

The main advantage that coq-lsp provides over regular Coq is that you have access to full theories and full documents.

so you can do for example, a theory level plugin, by registering to listen to certain events.

The engine can also do a lot more, etc... for example we recently implemented speculative execution. Still IMVHO very large gains can be made in performance on the Coq side.

But IMHO tooling for LLMs in Coq remains a huge obstacle, and so far I'm very unsure I will devote any more resources to it.

Oh, I was confused by terminology. By CoqSerAPI, you mean the proverbot9001 coq interfacing python library coq_serapy it sounds like. As distinct from the opam package coq-serapi, which is the ocaml and command line sexp interface that Emilio developed. Newer versions of coq_serapy support up through 8.18 I think, so if that's your bottlneck you should be able to update (though it looks like you included the source of an older version in your repo, so it might take some doing).

I checked out the lean Zulip discussion and had some comments there. For those who don't want to go over there to see, it seems like there are a few threats to validity that the authors will include in their next version (we've been corresponding over email):

- Test leakage - the GPT-4 model was trained on github in 2021, including the CompCert and miniF2F datasets. While there doesn't seem to be evidence that the model has memorized any of the test proofs directly, more subtle forms of test leakage have not been discounted.
- Not accounting for cost of queries - The abstract claims that COPRA is "faster" than other tools it compares to, where "faster" means "using fewer queries". However, the cost of a query isn't constant across the tools, and in fact queries in Proverbot9001 appear to run 100-1000x faster on a CPU than the COPRA queries do using the OpenAI datacenter hardware.
- It isn't quite possible to measure pass@k-inferences on Proverbot9001, so they use a proxy to compare whose accuracy can vary.
- The comparison uses a version of Proverbot9001 from 2019 that performs significantly worse than modern Proverbot9001. The newest results are unpublished (well, they're published in a web page), so that's fair, it's just good to know that the Proverbot9001 approach can do a bit better, affecting the comparison.

@Alex Sanchez-Stern Is there a plan to publish the newer ProverBot9001 methodologies and results in a paper? Is ProverBot9001 still being improved and developed? Is there any plan to make it something Coq users can use?

I don't think we're planning on publishing the improvements to base proverbot as they're mostly engineering improvements, which are hard to publish. We're currently building on Proverbot9001 for a few projects at UMass Amherst, UCSD, and UCLA, and as we do that we're constantly making improvements to the base model. In terms of usability, it's open source and works on the latest version of Coq, the two usability issues I see are (1) it doesn't have a plugin interface that can suggest proofs directly in the IDE, you have to run it externally on your project from the command line (it can generate HTML or .v formatted solutions), and (2) setup can be tricky as with any research code, though at this point 10-20 people I know have set it up.

Alex Sanchez-Stern said:

Oh, I was confused by terminology. By CoqSerAPI, you mean the proverbot9001 coq interfacing python library coq_serapy it sounds like. As distinct from the opam package coq-serapi, which is the ocaml and command line sexp interface that Emilio developed. Newer versions of coq_serapy support up through 8.18 I think, so if that's your bottlneck you should be able to update (though it looks like you included the source of an older version in your repo, so it might take some doing).

@Amitayush Thakur does that solve your issue?

Emilio Jesús Gallego Arias said:

Hi Bas Spitters , thanks for the ping!

I think porting

`CoqSerAPI`

to newer Coq versions should be very easy, if not automatic; indeed I understand that they are on 8.15 out of practical reasons, moving to 8.16 doesn't add a lot of value for them I think, but Alex is the authority on CoqSerAPI. Alex gave me a very nice and detailed feedback on the improvements they'd need, so that's pending.Regarding tooling, indeed the tooling is in poor state, specially when compared to other systems. I'm unsure how to improve the tooling, in the technical side it is for me very clear what to do; but I feel there is for example large developer "cultural" resistance to unifying Coq developments in a more uniform way so they can be consumed as a dataset, using more advanced tools in the core of the system (say PPX), etc... Moreover, as far as I know there is not project-wide AI strategy; I tried to do a working group time ago but it didn't pick up, and my student working on the topic left for a startup (crazy turnover lol).

Depending on the kind of interaction you want to do, coq-lsp could make your life very easy, at the cost of writing a plugin, happy to help if there is a more concrete problem.

Thanks @Emilio Jesús Gallego Arias Do you have a write up/roadmap of what needs to be done ?

I'd imagine there's at least some interest from Coq (library) developers.

@Alex Sanchez-Stern If there's sufficient progress that the community should know about, but not a full paper, you could consider submitting it to https://popl24.sigplan.org/home/CoqPL-2024#Call-for-Presentations

Machine Learning ITP dataset (lean and agda) https://arxiv.org/abs/2310.16005 by @Andrej Bauer

Unfortunately, Coq is missing.

I provide some commentary on that dataset over at the Machine Learning channel of the Lean zulip. I don't think that data is necessarily that fancy. I think (but I could be mistaken) it is just the terms for each declaration (theorem, definition, axiom, etc) along with some data for recovering dependencies. If so, this information is readily available in a few places in Lean such as the olean files and by querying the Lean environment inside Lean. I assume you can similarly get it for Coq (although Coq has the challenge of 100s of different Coq packages). One could also look to existing datasets like CoqGym for possibly similar data.

I'm not the most impressed with their experiments from what I've seen. It seems they are using quite old techniques and it isn't immediate how applicable their results are to real concerns in machine learning theorem proving. Their link prediction task seems to be a form of premise selection. Nonetheless, I do think it is time we revisit premise selection. Premise section and argument selection are key to theorem proving and most papers have something like a premise selection tool inside them. (For example, MagnusHammer is a premise selection tool that largely replaces the external ATP call in SledgeHammer from what it seems.) It would be really interesting to try to compare many different premise selection methods from across the literature. (This is easier said than done of course.). There was once a premise selection dataset called HOLStep, but people don't use it anymore, and none of the modern premise selection approaches are represented in the HOLStep results. Maybe it is time to recover the topic.

if their main thing is sexprialized terms for declarations, then we collected a bunch of those for MathComp+ecosystem when we did learning to suggest lemma names: https://github.com/EngineeringSoftware/math-comp-corpus

I'm sure the same dataset could be refreshed for recent Coq+MathComp, most projects are now in Coq Platform (and so are the tools for generating sexps)

TLDR: In my opinion, we need to collect very different datasets: the datasets should record how humans actually develop Coq proofs and programs rather than the polished end results. We need to record our struggles with the Coq typechecker and the queries (e.g. Search) we invoke to find the right stuff. I will get to the "why" later, but first, I have 2 suggestions for getting such datasets and they may appeal differently to people with varying levels of privacy concerns.

1) Request the Coq community to not delete the query commands they use to produce their proofs. They can remain in the MR/PR (merge/pull request) until the MR/PR is ready to merge and some git protocol should be followed to ensure that the unclean version with all the queries remains in the history (there are many ways to do this). Since my ML research days (decade ago), I have appreciated how much data is gold. Sadly, most of my colleagues in the theorem-proving community rewrite their git history (not just the final version) to look very clean and thus lose information.

2) Instrument Coq IDEs to record the process of coming up with the right Coq function/proof. While writing a Coq definition/Fixpoint/proofs, this should include not just the final version but also the versions that had typechecker/tactic errors, the error messages, and the edits the IDE users did to fix them. It should also include the queries (Search/About/Check/Print...) the humans did during the process. I know many will have privacy concerns. But at least I would not have minded as long as I can review the final data before it is submitted for training AI models. Some employers may have concerns about leaking secret sauces but I guess most of the academia will not. Those employers may also collect the data anyway but not release it until they decide to open-source the proofs: it would anyway be hard in the long run to sell a proven product and not release the proofs.

WHY?

I guess nobody knows why modern neural networks work so well but it seems most of the breakthrough ideas in the field have come from thinking about and mimicking how humans (brains) work: from the inception to ideas like attention heads. So I think it is worthwhile thinking about how humans write Coq proofs/functions or even essays/papers. We never come up with the final version right away: it is often a long back and forth. To me it seems odd that the current LLM architectures do not have "backspace/left-arrow/delete" in their keyboard: they can only predict the next word even when they seem to already know that they made a mistake. Imagine humans writing a Coq program or an essay in one shot when your keyboard has no backspace/left-arrow/delete buttons: I would be extremely bad at it. For Coq programs/proofs, imagine writing them without being able to issue queries or look at typechecker errors.

A prompting loop can address some limitations but doesn't seem to go very far. For example, a few weeks ago, I tried a manual prompt loop where I fed error messages and tried to get ChatGPT4 to write a correct vec.append function *recursive on the second argument*, something that human beginners in Coq struggle with. It was able to correct some errors but did not get it fully right. I don't think that just feeding it the errors would have sufficed as it wasn't making progress. Then I asked ChatGPT to come up with a Search query for the thing it was struggling with. It came up with the right query and used the result I gave it to finally write the correct vec.append. The full interaction can be found here (my initial prompt was imprecise and it took me 3 prompts to precisely specify what I was asking ChatGPT to do) https://chat.openai.com/share/63cb977c-6022-4ee0-8618-8c6d04cb56b4

Later, I tried and failed to get ChatGPT to ask for query commands on its own when needed, rather than me reminding it at the right times in a prompt loop: https://chat.openai.com/share/75178430-7f60-41bd-9e92-0f56c31e36df

I don't think such prompting tricks will go far. I am not a neural network expert, but I think the architecture should be changed a bit to give ChatGPT more power to correct its mistakes and to ask for external help. Currently, the architecture allows only one action: predict the next word(let). It should be allowed to do the following action as well: delete the previous word. Moreover, in the context of Coq/Lean, it should also be allowed to choose the next step to be: evaluate a query (Check/Search/Print..) and use the result

To evaluate the hypothesis above, we would need a lot of new training data that can be collected if my suggestions at the top of this post are incorporated by theorem-proving communities (remember, data is the modern gold). Even then, it may take years to get a large enough dataset. For now, I think somebody can already do a similar experiment to test parts of the hypothesis: get online editing services like Google Docs, and Microsoft Word online, to just add the "delete last word" action in the architecture and train LLMs using human editing histories (hopefully they preserved them) and see if it improves the qualities of essays LLMs write, e.g. reduce hallucination. I neither own the GPUs or data to do this experiment, nor do I know who to ask to get this experiment done, but maybe someone here is well-connected.

@Abhishek Anand there is this dataset that recorded low-level interaction with Coq for some volunteer users: https://github.com/uwplse/analytics-data

Some more training on Lean and Isabelle datasets:

https://arxiv.org/abs/2310.10631

That one also trains on Coq, but only on the raw Coq `.v`

files IIRC, whereas it trains on proof-state/tactic pairs as well for Isabelle and Lean.

@Sean Welleck gave a talk recently on Llemma https://m.youtube.com/watch?v=bRHb-MVExJ4&pp=ygUfI21hbGxhYmh1bWluc3RpdHV0ZW9mdGVjaG5vbG9neQ%3D%3D

Last updated: Nov 29 2023 at 19:01 UTC