Hi folks, I asked ChatGPT to produce a group and monoid in Coq, that's what it got back to me XD:
Require Import mathcomp.ssreflect.ssreflect.
Set Implicit Arguments.
Unset Strict Implicit.
Unset Printing Implicit Defensive.
Structure Monoid : Type :=
{
carrier : Type;
op : carrier -> carrier -> carrier;
id : carrier;
associativity : associative op;
identity : left_id id op;
}.
Structure Group : Type :=
{
base : Monoid;
inv : carrier base -> carrier base;
inverse : forall x : carrier base, op (op base) (inv x) x = id base /\ op (op base) x (inv x) = id base;
}.
Coercion MonoidMixin (g : Group) : Monoid := g.(base).
Notation "1" := (id _).
Notation "x * y" := (op _ x y).
Not bad at all!
did you use "SSReflect" "MathComp" prompt engineering? Maybe even Hierarchy Builder...
Yes I asked it to do using math-comp style
First time I got actually what is a good suggestion for syntax XD
Structure Monoid :=
{
carrier : Set;
op : carrier -> carrier -> carrier;
id : carrier;
associativity : forall x y z : carrier, op (op x y) z = op x (op y z);
identity : forall x : carrier, op id x = x /\ op x id = x
}.
Structure Group := Monoid & {
inv : carrier -> carrier;
inverse : forall x : carrier, op (inv x) x = id /\ op x (inv x) = id
}.
Typescript-style XD
but looks like I would love to write this in Coq
seems weird that carrier is Set
?
Now writing some instances:
Definition IntAddMonoid : Monoid :=
{|
carrier := Z;
op := fun x y => x + y;
id := 0;
associativity := Zplus_assoc;
identity := Zplus_0_l;
|}.
Definition IntAddGroup : Group :=
{|
base := IntAddMonoid;
inv := fun x => -x;
inverse := fun x => (Zplus_opp_diag_l x, Zplus_opp_diag_r x);
|}.
Karl Palmskog said:
seems weird that carrier is
Set
?
Yes, after all ChatGPT has no clue about what it is doing XD
But I'm adding ChatGPT support to next coq-lsp release, we will see
Happy to add support for other people's systems by the way
isn't it better to hook into Copilot? nearly same model?
No idea @Karl Palmskog , happy to hook into whatever people think is fine
Most interesting for me is the button I just added "explain this proof"
It is not something meant to be serious, but to let people have fun and get ideas
I liked the "suggestion by tabbing" they did in the Copilot extension for VS Code. I really don't like services with signup, but this one was via GitHub, so one barrier less.
Yeah, def i don't like services with signup either
It seems to me ChatGPT is very good at following orders, but totally clueless about proofs
right, its training data has zero proof states
as for "following orders", I'm not sure the orders are exactly your orders as seen for example here
Interesting, thanks. So far whenever I have suggested it to take a correction on a proof, it has done more or less Ok, but indeed, it is just my very limited testing for a few weeks, that's why I want to add it to a wider interface
Asking it to explain the infinitude of primes proof in math-comp:
The statement Lemma prime_above m states that for every integer m, there exists a prime number p such that m < p.
The proof starts with the line have /pdivP[p pr_p p_dv_m1]: 1 < m! + 1. This uses the havetactical, which is used to introduce a new subgoal. The subgoal is to prove1 < m! + 1. The /pdivP[p pr_p p_dv_m1] part says that if the subgoal can be proven, then there exists a prime number p that divides m! + 1, and the pis stored in the variablep, and the proposition that pis prime is stored in the variablepr_p, and the proposition that pdividesm! + 1 is stored in the variable p_dv_m1.
The next line by rewrite addn1 ltnS fact_gt0 uses the by tactical to prove the subgoal 1 < m! + 1. The proof works by first adding 1tom! and then applying the ltnSlemma, which says thatn.+1 > nfor any integern. The lemma fact_gt0` states that the factorial of any positive integer is positive.
The next line exists p => //; says that there exists a prime number p such that m < p. The => // part says that the proof of this proposition is trivial, so it can be omitted.
The next line rewrite ltnNge; rewrites m < p to ~ (p <= m).
The next line apply: contraL p_dv_m1 => p_le_m applies the contraL lemma to the current goal, with the assumption p_dv_m1, to get a new goal that is the negation of the original goal. The => p_le_m part introduces a new hypothesis p_le_m, which states that p <= m.
The last line by rewrite dvdn_addr ?dvdn_fact ?prime_gt0 // gtnNdvd ?prime_gt1 finishes the proof. The dvdn_addr lemma states that if a divides b and c, then a divides b + c. The dvdn_fact lemma states that p divides m! if and only if p divides m. The prime_gt0 lemma states that any prime number is positive. The gtnNdvd lemma states that if p > n and p divides n, then p = n. The prime_gt1 lemma states that any prime number is greater than 1. The proof works by using these lemmas to conclude that if p <= m, then p cannot be a prime number.
Finally, the Qed command ends the proof.
The original proof for reference:
From Coq Require Import ssreflect ssrfun ssrbool.
From mathcomp Require Import eqtype ssrnat div prime.
(* A nice proof of the infinitude of primes, by Georges Gonthier *)
Lemma prime_above m : {p | m < p & prime p}.
Proof.
have /pdivP[p pr_p p_dv_m1]: 1 < m`! + 1
by rewrite addn1 ltnS fact_gt0.
exists p => //; rewrite ltnNge; apply: contraL p_dv_m1 => p_le_m.
by rewrite dvdn_addr ?dvdn_fact ?prime_gt0 // gtnNdvd ?prime_gt1.
Qed.
urgh, that's like a reading of the SSReflect lines with Check
lookups
I guess for people that don't know ssr syntax could be helpful?
Oh but indeed I've just realized in coq-lsp I have tons of dynamic info at hand, so I can generate a fairly interesting prompt
Could be a paper there XD
ah, but this is the problem, how do you do transparent/reproducible evals of nontrivial GPTChat applications unless you have an OpenAI co-author?
I guess you get an OpenAI co-author?
you can of course do experiments like this, but this is different from low-level tool evals.
For me ChatGPT helps because I can use it to test an interaction model of writing Coq proofs that I'd like to support well, independently of the engine. And of course, it can be useful to users.
based on the state of this repo, not sure they have many theorem proving people left...
Karl Palmskog said:
based on the state of this repo, not sure they have many theorem proving people left...
Indeed I'm not sure either; some key people in that area left OpenAI
this one also seems abandoned: https://github.com/openai/lean-gym (but probably there are living forks)
Tho the possibility to do studies exists, as coq-lsp
can almsot chat to ChatGPT and can almost run in github.dev
/ vscode.dev
so certainly you only need to recruit users, and have them clone a repos.
if all you aim for are user studies success, sure. But this limits scope quite a bit.
I'm not sure what OpenAI is doing, however I'm not sure either that the states of these repos means a lot. If they are working on some new stuff you will see it when it is released.
The explanation is pretty good but I guess ChatGPT got wrong contraL, ltnS, and I think the end's proving that p divides m! so it can't divide m! + 1?
And I wonder if the explanation overlaps with that on https://jscoq.github.io/wa/, which Google seems to rank pretty high whenever I google these lemmas... OTOH, I don't actually see much in common?
these state-of-the-art SSR proofs are one thing, but I'd like to see GPTChat make any sense whatsoever out of Coq 8.4 era semi-automated proofs... those are pretty accurate to describe as write-only
but I guess this is the main rift between Isabelle users and Coq users: "proofs must be understandable" (math proofs) vs. "proof scripts are doing tedious stuff and there's often not much value in looking at them" (PL proofs)
Karl Palmskog said:
these state-of-the-art SSR proofs are one thing, but I'd like to see GPTChat make any sense whatsoever out of Coq 8.4 era semi-automated proofs... those are pretty accurate to describe as write-only
Indeed here something like coq-lsp
can query intermediate goal state and put that in the prompt ;)
Emilio Jesús Gallego Arias said:
Most interesting for me is the button I just added "explain this proof"
@Emilio Jesús Gallego Arias
Could we have a button explain this error message?
Skimming the ML part of this Zulip, its looks like there are some projects trying to use the public GPT API to prove mathematical theorems or suggest next steps. Is there anyone working on a public project to do a more deep integration of proof assistants with LLMs, i.e. looking for customized ways to train LLMs for this usecase? It seems the OpenAI co-founder himself seems to have a lot of interest in proof assistants, especially improving proof automation for code verfication and he seems to understand FM in good depth: see timestamp 25:30-30:00 of this video https://twitter.com/i/broadcasts/1lPKqbEWzaMGb
https://github.com/lupantech/dl4math
As for papers using the GPT API this paper just dropped: https://arxiv.org/abs/2310.04353
It uses GPT-4 and compares against ProverBot9001 on Coq (Compcert). It also compares against ReProver on Lean (MiniF2F). There is an extensive discussion on the Lean Zulip here: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/COPRA
It makes me a bit more optimistic about the future of AI when I see papers on integrating these chatbots with external databases.
If you want the tool to be able to work with humans in the sense of integrating new theorems and definitions as the human proves them, it needs to be able to learn the meaning of new symbols without expensive training.
I wouldn't want to use a tool if it wasn't able to prove theorems using symbols that I defined.
And it's necessary for the reinforcement learning problem of learning a whole library by advancing through the theorems step by step, I think, which is one of the most interesting and difficult problems. But to do this you need some kind of fine grained control over what the machine is allowed to know (things already proven) and what it's not allowed to use (things later in the data set that depend on this). There are probably simpler hacks to do this, I don't know, but intuitively I would be suspicious of any kind of "reinforcement learning" paper that trained on the whole set of theorems in some way, so that the advanced theorems were already known to the machine, even if they blocked the machine from using these results. A more honest approach imo is that the machine is only allowed to learn symbols as they are introduced and it has access to an external database at any time that it can use to look up the meanings of symbols.
@Patrick Nicodemus I generally agree with you, but there is also a possible application of reinforcement learning where an AI tries and learns from proofs of theorems in any order. That is in the development of a big project. Many projects (e.g. Flyspeck, Liquid Tensor Experiment, Sphere Inversion, etc) first outline the theorems and lemmas to be proved and then start filling in the proofs. A reinforcement AI agent could (say every night) attempt to prove the remaining unproved lemmas (as well as reprove existing lemmas) and learn from any successful proofs.
This is a very similar setup to many RL for theorem-proving papers (e.g. "Learning to Reason in Large Theories without Imitation" and "Reinforcement Learning of Theorem Proving").
@Jason Rute I haven't read the paper yet, but do they explain why they restrict to compcert? Only because it's the benchmark that ProverBot was using? IIUC Coq should be an attractive platform for LLMs because it has the largest library of formal proofs, but apparently noone is using that.
from what I recall, the original ProverBot9000 paper was indeed restricted to CompCert for evaluation, so I guess they limit to where a comparison is possible
@Bas Spitters Yes, the ProverBot9001 paper only tests on CompCert. There is another unpublished evaluation of ProverBot9001 on the same packages used in the CoqGym benchmark. cc @Alex Sanchez-Stern
as to "largest library of formal proofs": we have on one hand the Coq opam repo and on the other hand Platform. But:
Unfortunately I don't see this situation changing any time soon. What one can do is curate "extensions" of the Platform releases for machine learning, but this needs to be done continually.
we have a bunch of packages pending to be added to the Platform, but since they are "general utility", probably not enough to attract machine learning interest
Hmm what about Coq community? Would that be a good benchmark set?
quite a few Platform packages are hosted in Coq-community (see :rooster: symbol on website). While some of the remaining packages are high profile (:star: symbol) and would fit in a machine learning benchmark, we also have experimental packages that probably should not be used, and may not be compatible with latest Coq version.
so the best approach would probably still be: take Coq Platform as base, and manually curate additional packages from the Coq opam archive (the highest quality Coq-community ones such as fourcolor are already there)
Ah, for some reason I thought that new releases of Coq were tested against coq-community to make sure those are always compatible.
a subset of the projects in Coq-community are part of Coq's CI and are tested in every commit to Coq, which is yet another distinct set of projects (not all of them packaged in the Coq opam repo): https://github.com/search?q=topic%3Acoq-ci+org%3Acoq-community&type=Repositories
Huh, that link doesn't work for me, it just shows empty search results
I changed the link, try again now
Ah yeah that works. 13 projects that are used in CI from coq community, good to know. That might be a useful benchmark set for one of my projects at some point.
See also https://github.com/coq-community/manifesto/#position-in-the-coq-ecosystem for the difference between coq-community, Coq Platform, and Coq's CI projects.
Thank you for all your feedback. Please feel free to ask any questions related to our work https://arxiv.org/abs/2310.04353
Bas Spitters said:
Jason Rute I haven't read the paper yet, but do they explain why they restrict to compcert? Only because it's the benchmark that ProverBot was using? IIUC Coq should be an attractive platform for LLMs because it has the largest library of formal proofs, but apparently noone is using that.
We restrict to CompCert because we had a comparison available from Proverbot. Other than that we have monthly quota restrictions from OpenAI on amount of usage and GPT-4 is relatively costly. So we couldn't run on other benchmarks.
@Amitayush Thakur Thanks! How hard is it to run on new benchmark suites? math-comp comes to mind. Maybe you still have a bit of this months budget left :-)
@Amitayush Thakur Where did you get the data to do the comparison with ProverBot on Compcert? (For example, the numbers and theorem lists needed to make your plots.) Did you rerun the ProverBot benchmark, or get the data from the ProverBot authors? I'd be interested in accessing this data? (I'd also be interested in the data for ASTactic, DIVA, TacTok, and Passport as well, but I think at least the ASTactic data is just lost forever.)
Bas Spitters said:
Amitayush Thakur Thanks! How hard is it to run on new benchmark suites? math-comp comes to mind. Maybe you still have a bit of this months budget left :-)
It is not hard to run, our tool is open and anyone can run given they have OpenAI account with access to GPT-4 (https://github.com/trishullab/copra). Some restrictions exists because we use SerAPI and CoqSerAPI which doesn't have a support beyond Coq 8.15
Jason Rute said:
Amitayush Thakur Where did you get the data to do the comparison with ProverBot on Compcert? (For example, the numbers and theorem lists needed to make your plots.) Did you rerun the ProverBot benchmark, or get the data from the ProverBot authors? I'd be interested in accessing this data? (I'd also be interested in the data for ASTactic, DIVA, TacTok, and Passport as well, but I think at least the ASTactic data is just lost forever.)
One can run Proverbot9001 from their GitHub repository. However, thanks to @Alex Sanchez-Stern (author of Proverbot) who shared the results from the historical runs of Proverbot, we didn't have to re-run Proverbot. The results shown in the Proverbot report are very detailed (https://proverbot9001.ucsd.edu/reports/2019-11-20T18d38d02-0700+cd762eb9e7e6e44153bd766654727a36a3dcad0b/report.html)
Amitayush Thakur said:
Bas Spitters said:
Amitayush Thakur Thanks! How hard is it to run on new benchmark suites? math-comp comes to mind. Maybe you still have a bit of this months budget left :-)
It is not hard to run, our tool is open and anyone can run given they have OpenAI account with access to GPT-4 (https://github.com/trishullab/copra). Some restrictions exists because we use SerAPI and CoqSerAPI which doesn't have a support beyond Coq 8.15
@Emilio Jesús Gallego Arias Any insights on this. I know you've been thinking about how Coq should interface with LLMs.
Hi @Bas Spitters , thanks for the ping!
I think porting CoqSerAPI
to newer Coq versions should be very easy, if not automatic; indeed I understand that they are on 8.15 out of practical reasons, moving to 8.16 doesn't add a lot of value for them I think, but Alex is the authority on CoqSerAPI. Alex gave me a very nice and detailed feedback on the improvements they'd need, so that's pending.
Regarding tooling, indeed the tooling is in poor state, specially when compared to other systems. I'm unsure how to improve the tooling, in the technical side it is for me very clear what to do; but I feel there is for example large developer "cultural" resistance to unifying Coq developments in a more uniform way so they can be consumed as a dataset, using more advanced tools in the core of the system (say PPX), etc... Moreover, as far as I know there is not project-wide AI strategy; I tried to do a working group time ago but it didn't pick up, and my student working on the topic left for a startup (crazy turnover lol).
Depending on the kind of interaction you want to do, coq-lsp could make your life very easy, at the cost of writing a plugin, happy to help if there is a more concrete problem.
@Lasse Blaauwbroek also has very nice tooling in tactician, I'm unsure how many hacks tactician is using now, but in principle I took all of Lasse's feedback into account so hopefully he'd have a nicer time.
The main advantage that coq-lsp provides over regular Coq is that you have access to full theories and full documents.
so you can do for example, a theory level plugin, by registering to listen to certain events.
The engine can also do a lot more, etc... for example we recently implemented speculative execution. Still IMVHO very large gains can be made in performance on the Coq side.
But IMHO tooling for LLMs in Coq remains a huge obstacle, and so far I'm very unsure I will devote any more resources to it.
Oh, I was confused by terminology. By CoqSerAPI, you mean the proverbot9001 coq interfacing python library coq_serapy it sounds like. As distinct from the opam package coq-serapi, which is the ocaml and command line sexp interface that Emilio developed. Newer versions of coq_serapy support up through 8.18 I think, so if that's your bottlneck you should be able to update (though it looks like you included the source of an older version in your repo, so it might take some doing).
I checked out the lean Zulip discussion and had some comments there. For those who don't want to go over there to see, it seems like there are a few threats to validity that the authors will include in their next version (we've been corresponding over email):
@Alex Sanchez-Stern Is there a plan to publish the newer ProverBot9001 methodologies and results in a paper? Is ProverBot9001 still being improved and developed? Is there any plan to make it something Coq users can use?
I don't think we're planning on publishing the improvements to base proverbot as they're mostly engineering improvements, which are hard to publish. We're currently building on Proverbot9001 for a few projects at UMass Amherst, UCSD, and UCLA, and as we do that we're constantly making improvements to the base model. In terms of usability, it's open source and works on the latest version of Coq, the two usability issues I see are (1) it doesn't have a plugin interface that can suggest proofs directly in the IDE, you have to run it externally on your project from the command line (it can generate HTML or .v formatted solutions), and (2) setup can be tricky as with any research code, though at this point 10-20 people I know have set it up.
Alex Sanchez-Stern said:
Oh, I was confused by terminology. By CoqSerAPI, you mean the proverbot9001 coq interfacing python library coq_serapy it sounds like. As distinct from the opam package coq-serapi, which is the ocaml and command line sexp interface that Emilio developed. Newer versions of coq_serapy support up through 8.18 I think, so if that's your bottlneck you should be able to update (though it looks like you included the source of an older version in your repo, so it might take some doing).
@Amitayush Thakur does that solve your issue?
Emilio Jesús Gallego Arias said:
Hi Bas Spitters , thanks for the ping!
I think porting
CoqSerAPI
to newer Coq versions should be very easy, if not automatic; indeed I understand that they are on 8.15 out of practical reasons, moving to 8.16 doesn't add a lot of value for them I think, but Alex is the authority on CoqSerAPI. Alex gave me a very nice and detailed feedback on the improvements they'd need, so that's pending.Regarding tooling, indeed the tooling is in poor state, specially when compared to other systems. I'm unsure how to improve the tooling, in the technical side it is for me very clear what to do; but I feel there is for example large developer "cultural" resistance to unifying Coq developments in a more uniform way so they can be consumed as a dataset, using more advanced tools in the core of the system (say PPX), etc... Moreover, as far as I know there is not project-wide AI strategy; I tried to do a working group time ago but it didn't pick up, and my student working on the topic left for a startup (crazy turnover lol).
Depending on the kind of interaction you want to do, coq-lsp could make your life very easy, at the cost of writing a plugin, happy to help if there is a more concrete problem.
Thanks @Emilio Jesús Gallego Arias Do you have a write up/roadmap of what needs to be done ?
I'd imagine there's at least some interest from Coq (library) developers.
@Alex Sanchez-Stern If there's sufficient progress that the community should know about, but not a full paper, you could consider submitting it to https://popl24.sigplan.org/home/CoqPL-2024#Call-for-Presentations
Machine Learning ITP dataset (lean and agda) https://arxiv.org/abs/2310.16005 by @Andrej Bauer
Unfortunately, Coq is missing.
I provide some commentary on that dataset over at the Machine Learning channel of the Lean zulip. I don't think that data is necessarily that fancy. I think (but I could be mistaken) it is just the terms for each declaration (theorem, definition, axiom, etc) along with some data for recovering dependencies. If so, this information is readily available in a few places in Lean such as the olean files and by querying the Lean environment inside Lean. I assume you can similarly get it for Coq (although Coq has the challenge of 100s of different Coq packages). One could also look to existing datasets like CoqGym for possibly similar data.
I'm not the most impressed with their experiments from what I've seen. It seems they are using quite old techniques and it isn't immediate how applicable their results are to real concerns in machine learning theorem proving. Their link prediction task seems to be a form of premise selection. Nonetheless, I do think it is time we revisit premise selection. Premise section and argument selection are key to theorem proving and most papers have something like a premise selection tool inside them. (For example, MagnusHammer is a premise selection tool that largely replaces the external ATP call in SledgeHammer from what it seems.) It would be really interesting to try to compare many different premise selection methods from across the literature. (This is easier said than done of course.). There was once a premise selection dataset called HOLStep, but people don't use it anymore, and none of the modern premise selection approaches are represented in the HOLStep results. Maybe it is time to recover the topic.
if their main thing is sexprialized terms for declarations, then we collected a bunch of those for MathComp+ecosystem when we did learning to suggest lemma names: https://github.com/EngineeringSoftware/math-comp-corpus
I'm sure the same dataset could be refreshed for recent Coq+MathComp, most projects are now in Coq Platform (and so are the tools for generating sexps)
TLDR: In my opinion, we need to collect very different datasets: the datasets should record how humans actually develop Coq proofs and programs rather than the polished end results. We need to record our struggles with the Coq typechecker and the queries (e.g. Search) we invoke to find the right stuff. I will get to the "why" later, but first, I have 2 suggestions for getting such datasets and they may appeal differently to people with varying levels of privacy concerns.
1) Request the Coq community to not delete the query commands they use to produce their proofs. They can remain in the MR/PR (merge/pull request) until the MR/PR is ready to merge and some git protocol should be followed to ensure that the unclean version with all the queries remains in the history (there are many ways to do this). Since my ML research days (decade ago), I have appreciated how much data is gold. Sadly, most of my colleagues in the theorem-proving community rewrite their git history (not just the final version) to look very clean and thus lose information.
2) Instrument Coq IDEs to record the process of coming up with the right Coq function/proof. While writing a Coq definition/Fixpoint/proofs, this should include not just the final version but also the versions that had typechecker/tactic errors, the error messages, and the edits the IDE users did to fix them. It should also include the queries (Search/About/Check/Print...) the humans did during the process. I know many will have privacy concerns. But at least I would not have minded as long as I can review the final data before it is submitted for training AI models. Some employers may have concerns about leaking secret sauces but I guess most of the academia will not. Those employers may also collect the data anyway but not release it until they decide to open-source the proofs: it would anyway be hard in the long run to sell a proven product and not release the proofs.
WHY?
I guess nobody knows why modern neural networks work so well but it seems most of the breakthrough ideas in the field have come from thinking about and mimicking how humans (brains) work: from the inception to ideas like attention heads. So I think it is worthwhile thinking about how humans write Coq proofs/functions or even essays/papers. We never come up with the final version right away: it is often a long back and forth. To me it seems odd that the current LLM architectures do not have "backspace/left-arrow/delete" in their keyboard: they can only predict the next word even when they seem to already know that they made a mistake. Imagine humans writing a Coq program or an essay in one shot when your keyboard has no backspace/left-arrow/delete buttons: I would be extremely bad at it. For Coq programs/proofs, imagine writing them without being able to issue queries or look at typechecker errors.
A prompting loop can address some limitations but doesn't seem to go very far. For example, a few weeks ago, I tried a manual prompt loop where I fed error messages and tried to get ChatGPT4 to write a correct vec.append function recursive on the second argument, something that human beginners in Coq struggle with. It was able to correct some errors but did not get it fully right. I don't think that just feeding it the errors would have sufficed as it wasn't making progress. Then I asked ChatGPT to come up with a Search query for the thing it was struggling with. It came up with the right query and used the result I gave it to finally write the correct vec.append. The full interaction can be found here (my initial prompt was imprecise and it took me 3 prompts to precisely specify what I was asking ChatGPT to do) https://chat.openai.com/share/63cb977c-6022-4ee0-8618-8c6d04cb56b4
Later, I tried and failed to get ChatGPT to ask for query commands on its own when needed, rather than me reminding it at the right times in a prompt loop: https://chat.openai.com/share/75178430-7f60-41bd-9e92-0f56c31e36df
I don't think such prompting tricks will go far. I am not a neural network expert, but I think the architecture should be changed a bit to give ChatGPT more power to correct its mistakes and to ask for external help. Currently, the architecture allows only one action: predict the next word(let). It should be allowed to do the following action as well: delete the previous word. Moreover, in the context of Coq/Lean, it should also be allowed to choose the next step to be: evaluate a query (Check/Search/Print..) and use the result
To evaluate the hypothesis above, we would need a lot of new training data that can be collected if my suggestions at the top of this post are incorporated by theorem-proving communities (remember, data is the modern gold). Even then, it may take years to get a large enough dataset. For now, I think somebody can already do a similar experiment to test parts of the hypothesis: get online editing services like Google Docs, and Microsoft Word online, to just add the "delete last word" action in the architecture and train LLMs using human editing histories (hopefully they preserved them) and see if it improves the qualities of essays LLMs write, e.g. reduce hallucination. I neither own the GPUs or data to do this experiment, nor do I know who to ask to get this experiment done, but maybe someone here is well-connected.
@Abhishek Anand there is this dataset that recorded low-level interaction with Coq for some volunteer users: https://github.com/uwplse/analytics-data
Some more training on Lean and Isabelle datasets:
https://arxiv.org/abs/2310.10631
That one also trains on Coq, but only on the raw Coq .v
files IIRC, whereas it trains on proof-state/tactic pairs as well for Isabelle and Lean.
@Sean Welleck gave a talk recently on Llemma https://m.youtube.com/watch?v=bRHb-MVExJ4&pp=ygUfI21hbGxhYmh1bWluc3RpdHV0ZW9mdGVjaG5vbG9neQ%3D%3D
has anyone used OpenAI's fine-tuning API to make ChatGPT better at problems related to Coq proving/programming? I am thinking about using fine tuning to teach it to do queries instead of hallucinating, but still uncertain about how to go about it.
@Abhishek Anand the instructions here do not work for me. Perhaps it needs a paid version
https://openai.com/blog/introducing-gpts
I think it'd be nicer for the community if someone did fine-tuning for Coq on top of the Facebook's LLaMA(2) models, then at least researchers could run the model directly
Karl Palmskog said:
I think it'd be nicer for the community if someone did fine-tuning for Coq on top of the Facebook's LLaMA(2) models, then at least researchers could run the model directly
The Llemma model?
in that direction, sure, but I believe Llemma was tuned on lots of stuff that is unlikely to help solving Coq-specific tasks ("a mixture of scientific papers, web data containing mathematics, and mathematical code")
as I think we established in the Coq community survey, a large chunk of Coq (theory, practice) overlaps with programming rather than pure math
It was built on top of Code-LLaMa so there is also a lot of programming code underneath, and it was fine-tuned on Coq files (among other ITP files). Of course, if Coq is the main goal, there might be a better Coq-specific training dataset, which could even depend on the specific Coq task. But at some point, the cost of fine-tuning for each specific task is too much. I don't know what you all have in mind, but it might be worth trying Llemma first before building a Coq-specific LLaMA model.
For reference, training Llemma 7B was 23k A100-hours. Training Llemma 34B was 47k A100-hours. Of course, there may be more efficient methods like LoRA, etc. to fine-tune a model.
Bas Spitters said:
Abhishek Anand the instructions here do not work for me. Perhaps it needs a paid version
https://openai.com/blog/introducing-gpts
Sorry for not being precise earlier: I meant this API https://platform.openai.com/docs/guides/fine-tuning
(custom GPTs seem to be a different thing https://community.openai.com/t/custom-gpts-vs-fine-tuning-whats-the-difference/477738/4)
Thanks Karl and Jason for the suggestions about open-source models.
Bas Spitters said:
Thanks Emilio Jesús Gallego Arias Do you have a write up/roadmap of what needs to be done ?
I'd imagine there's at least some interest from Coq (library) developers.
I have plenty of notes, and I tried to do a machine learning working group (https://github.com/coq/coq/wiki/Coq-Topic-Working-Group-Document-Data-and-Machine-Learning-Platform) and we had a couple of meetings.
But in the end, coordination was too hard, and I am very short on time. But it'd be nice to maybe try to resurrect the working group?
Abhishek Anand said:
has anyone used OpenAI's fine-tuning API to make ChatGPT better at problems related to Coq proving/programming? I am thinking about using fine tuning to teach it to do queries instead of hallucinating, but still uncertain about how to go about it.
one interesting (IMO) experiment that should be easy to do is to fine tune ChatGPT or other models on the data collected by the REPLICA work (https://github.com/uwplse/analytics-data).
A custom Llama model for autoformalization of Lean4 statements. It does better on the MiniF2F benchmark, which does not include Coq. They seem to be unaware of the work by @Jason Rute and @Lasse Blaauwbroek
https://arxiv.org/pdf/2407.03203
Thanks Bas. This paper, while similar to our work in its goal (prove theorems automatically), takes a very different path using LLMs to generate the whole proof in one go (with 128 attempts). In that regard, it is very similar to the DeepSeek-Prover paper, both in the whole proof generation and in getting better data through auto-formalization. (DeepSeek-Prover does much better, but I think they also did more training and used more computation in the end.) I still need to read more about this paper, especially the part about Natural Language Guidance. I can't tell if it is similar to the natural language guidance in Draft-Sketch-Prove and its successors. If I read more of the paper, I might summarize it later on the Lean Zulip (in the "Machine Learning for Theorem Proving" stream).
There is a nice discussion about the DeepSeek-Prover paper on the Lean Zulip at https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/DeepSeek-Prover .
Also, as you said, since MiniF2F isn't in Coq and I think both of these papers are targeting competition-style math problems, rather than real-world theorem proving, it is hard to directly compare their work to ours. That is the Achilles heel of this field.
Thank @Jason Rute Is my recollection correct that informalization/autoformalization of Coq proofs is still mostly open? I had a quick peek here https://github.com/lupantech/dl4math, but that project seems to have stalled.
I'm not sure if anyone is working on it, but I also think the ideas would be the exact same as in Lean or Isabelle. The challenge would be just to extend it to Coq.
There might be an advantage of having a tactic language closer to NL (like Isabelle and Lean). We had Czar in Coq, but it was discontinued.
I guess we'll have to wait and see...
I'm not sure if anyone is yet using auto-formalization to do actual real-world formalization in Lean, Coq, or Isabelle. It seems it is mostly used to test AI systems, construct training data for other AI systems, and convert informal proof sketches into formal proof sketches which can be filled in with automation. Autoformalization of definitions and new concepts would be very different from auto-formalization of theorem statements with standard mathematical objects. We only really have a grasp on the latter right now.
I just found an undergraduate thesis about using LLMs in Coq, including a form of proof repair that improves results. https://dash.harvard.edu/handle/1/37379274 . (The author doesn't appear to be on this Zulip, but their advisor @Nada Amin appears to be.) They appear to test on software foundations. Their GPT-4 model with "lemma extraction" can solve 79.42% of theorems while CoqHammer (cvc4 backend with 20s timeout) solves 78.73%. Good work for an undergrad thesis! Their "lemma extraction" algorithm in section 4.2.1 appears to be novel and it seems to be a sophisticated take on how to repair a bad proof from a language model. Namely, you find the place where the error occurs, and try to get the language model to generate a proof for that goal (converting it into a theorem). So it seems like it is a form of tree search, but taking big steps, only branching on error states, which I think is a good way to make tree search more amenable to LLMs. It's much more sophisticated than what several other papers do where they just generate 128 proofs independently. (I think their "lemma extraction" method is inspired by the proof-repair component of Baldur by @Emily First and @Talia Ringer, which is itself a sophisticated take on whole proof generation with LLMs and proof repair.)
I'm not sure I fully understand their GPT-4 baseline. Do they generate one proof? Many proofs? Because if they are comparing GPT-4 with just one call to GPT-4 vs their "lemma extraction" method which has many calls to GPT-4, it might not be a fair comparison. I'd like to see at least a similar amount of computing used for the baseline. (They might do this, I can't find the details.)
As for how it compares to other AI for Coq projects, like Graph2Tac, ProverBot9001, and Tactician, it is hard to know. I think testing on software foundations is quite limited. For one, it is popular, right? So there may be a lot of solutions already on the Internet in GPT-4's training data. And it seems like a fairly easy benchmark with CoqHammer already getting ~79%. I would love a comparison on a more interesting benchmark. (This is really the biggest problem in this field, namely a lack of good benchmarks, especially ones that haven't already been compromised by LLM pretraining.)
I wonder what a "Bachelor with Honors" thesis is - some sort of in depth study after a bachelor to get an orientation for a later master study?
In some US colleges to get honors with your bachelor degree, you have to write a thesis. It is like a mini masters thesis. Some particularly good students do impressive original research.
I see - I thought a bachelor always requires a thesis (in Germany this is so afaik). Indeed impressive work!
This indeed looks nice. However, they only do Logical Foundations, not the whole Software Foundations series. It would be interesting to benchmark the other books too...
They don't compare to Tactician and Graph2Tac. I'm curious how well they do on LF.
A couple of years ago, one of my students implemented RoosterSpec https://github.com/mikkelmilo/rooster-spec
A Coq implementation of QuickSpec. It is known that such intermediate lemmas help automation, e.g. for Vampire
https://link.springer.com/chapter/10.1007/978-3-031-63498-7_13
Something similar might also work for LLMs.
There is also Moogle for to search lean libraries using natural language:
https://morph.so/blog/the-personal-ai-proof-engineer/
https://www.moogle.ai/
I wonder whether the Coq has the datasets to allow training such a tool.
A masters student of mine just did a master's thesis about a search engine
for Coq, and will talk about it at AITP. Not sure when the thesis will be
available online though.
Great @Talia Ringer is the engine available?
I just forwarded this to the student, hopefully he can chime in
Another undergrad thesis on LLMs for Coq, this one by Pierce Malony at Princeton (not in this Zulip): http://arks.princeton.edu/ark:/88435/dsp01w3763b12c I can’t tell if it is freely available anywhere. I’d would be interested to learn more about this project (but also worry about benchmarking LLMs like GPT-4 on CoqGym since there is a lot of possibility of data leakage).
Oh, the paper is here: https://github.com/piercemaloney/senior-thesis/blob/main/written_final_report.pdf
There are a number of interesting ideas here, including putting a lot of context in the prompt, using A* by using GPT-4 as an evaluator to estimate the score of a given proof state, and more. As for the evaluations, while great for an undergraduate thesis, it is hard to know what to conclude here. The results look good, but it is hard to know how it compares to other AI for theorem proving, or even AI for Coq models:
For these reasons alone, I'm very reluctant to say with any certainty that this approach will generalize to new theorems. But at the same time, it might. Lately, several papers have shown that fine-tuned LLMs do fairly well at generating ITP proofs.
And here is another paper published in a journal: https://dl.acm.org/doi/abs/10.1145/3671016.3674823 (I’m not on my work computer so I can’t access it. I find it funny it calls COPRA “state of the art”. :smile: It also has the problem of testing LLM based solutions on CompCert.
Bas Spitters said:
There is also Moogle for to search lean libraries using natural language:
https://morph.so/blog/the-personal-ai-proof-engineer/
https://www.moogle.ai/I wonder whether the Coq has the datasets to allow training such a tool.
There are now two more semantic search projects for Lean that are a bit more open. You might be able to borrow what they are doing.
@Talia Ringer any news from your student?
Last updated: Oct 13 2024 at 01:02 UTC