Stream: Machine learning and automation

Topic: Chat GPT


view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:29):

Hi folks, I asked ChatGPT to produce a group and monoid in Coq, that's what it got back to me XD:

Require Import mathcomp.ssreflect.ssreflect.

Set Implicit Arguments.
Unset Strict Implicit.
Unset Printing Implicit Defensive.

Structure Monoid : Type :=
  {
    carrier : Type;
    op : carrier -> carrier -> carrier;
    id : carrier;
    associativity : associative op;
    identity : left_id id op;
  }.

Structure Group : Type :=
  {
    base : Monoid;
    inv : carrier base -> carrier base;
    inverse : forall x : carrier base, op (op base) (inv x) x = id base /\ op (op base) x (inv x) = id base;
  }.

Coercion MonoidMixin (g : Group) : Monoid := g.(base).

Notation "1" := (id _).
Notation "x * y" := (op _ x y).

Not bad at all!

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:30):

did you use "SSReflect" "MathComp" prompt engineering? Maybe even Hierarchy Builder...

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:30):

Yes I asked it to do using math-comp style

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:30):

First time I got actually what is a good suggestion for syntax XD

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:30):

Structure Monoid :=
  {
    carrier : Set;
    op : carrier -> carrier -> carrier;
    id : carrier;
    associativity : forall x y z : carrier, op (op x y) z = op x (op y z);
    identity : forall x : carrier, op id x = x /\ op x id = x
  }.

Structure Group := Monoid & {
    inv : carrier -> carrier;
    inverse : forall x : carrier, op (inv x) x = id /\ op x (inv x) = id
  }.

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:30):

Typescript-style XD

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:31):

but looks like I would love to write this in Coq

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:31):

seems weird that carrier is Set?

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:31):

Now writing some instances:

Definition IntAddMonoid : Monoid :=
  {|
    carrier := Z;
    op := fun x y => x + y;
    id := 0;
    associativity := Zplus_assoc;
    identity := Zplus_0_l;
  |}.

Definition IntAddGroup : Group :=
  {|
    base := IntAddMonoid;
    inv := fun x => -x;
    inverse := fun x => (Zplus_opp_diag_l x, Zplus_opp_diag_r x);
  |}.

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:32):

Karl Palmskog said:

seems weird that carrier is Set?

Yes, after all ChatGPT has no clue about what it is doing XD

But I'm adding ChatGPT support to next coq-lsp release, we will see

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:32):

Happy to add support for other people's systems by the way

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:32):

isn't it better to hook into Copilot? nearly same model?

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:33):

No idea @Karl Palmskog , happy to hook into whatever people think is fine

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:33):

Most interesting for me is the button I just added "explain this proof"

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:33):

It is not something meant to be serious, but to let people have fun and get ideas

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:33):

I liked the "suggestion by tabbing" they did in the Copilot extension for VS Code. I really don't like services with signup, but this one was via GitHub, so one barrier less.

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:34):

Yeah, def i don't like services with signup either

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:34):

It seems to me ChatGPT is very good at following orders, but totally clueless about proofs

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:35):

right, its training data has zero proof states

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:36):

as for "following orders", I'm not sure the orders are exactly your orders as seen for example here

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:37):

Interesting, thanks. So far whenever I have suggested it to take a correction on a proof, it has done more or less Ok, but indeed, it is just my very limited testing for a few weeks, that's why I want to add it to a wider interface

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:39):

Asking it to explain the infinitude of primes proof in math-comp:

The statement Lemma prime_above m states that for every integer m, there exists a prime number p such that m < p.

The proof starts with the line have /pdivP[p pr_p p_dv_m1]: 1 < m! + 1. This uses the havetactical, which is used to introduce a new subgoal. The subgoal is to prove1 < m! + 1. The /pdivP[p pr_p p_dv_m1] part says that if the subgoal can be proven, then there exists a prime number p that divides m! + 1, and the pis stored in the variablep, and the proposition that pis prime is stored in the variablepr_p, and the proposition that pdividesm! + 1 is stored in the variable p_dv_m1.

The next line by rewrite addn1 ltnS fact_gt0 uses the by tactical to prove the subgoal 1 < m! + 1. The proof works by first adding 1tom! and then applying the ltnSlemma, which says thatn.+1 > nfor any integern. The lemma fact_gt0` states that the factorial of any positive integer is positive.

The next line exists p => //; says that there exists a prime number p such that m < p. The => // part says that the proof of this proposition is trivial, so it can be omitted.

The next line rewrite ltnNge; rewrites m < p to ~ (p <= m).

The next line apply: contraL p_dv_m1 => p_le_m applies the contraL lemma to the current goal, with the assumption p_dv_m1, to get a new goal that is the negation of the original goal. The => p_le_m part introduces a new hypothesis p_le_m, which states that p <= m.

The last line by rewrite dvdn_addr ?dvdn_fact ?prime_gt0 // gtnNdvd ?prime_gt1 finishes the proof. The dvdn_addr lemma states that if a divides b and c, then a divides b + c. The dvdn_fact lemma states that p divides m! if and only if p divides m. The prime_gt0 lemma states that any prime number is positive. The gtnNdvd lemma states that if p > n and p divides n, then p = n. The prime_gt1 lemma states that any prime number is greater than 1. The proof works by using these lemmas to conclude that if p <= m, then p cannot be a prime number.

Finally, the Qed command ends the proof.

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:40):

The original proof for reference:

From Coq Require Import ssreflect ssrfun ssrbool.
From mathcomp Require Import eqtype ssrnat div prime.

(* A nice proof of the infinitude of primes, by Georges Gonthier *)
Lemma prime_above m : {p | m < p & prime p}.
Proof.
have /pdivP[p pr_p p_dv_m1]: 1 < m`! + 1
  by rewrite addn1 ltnS fact_gt0.
exists p => //; rewrite ltnNge; apply: contraL p_dv_m1 => p_le_m.
by rewrite dvdn_addr ?dvdn_fact ?prime_gt0 // gtnNdvd ?prime_gt1.
Qed.

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:40):

urgh, that's like a reading of the SSReflect lines with Check lookups

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:46):

I guess for people that don't know ssr syntax could be helpful?

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:47):

Oh but indeed I've just realized in coq-lsp I have tons of dynamic info at hand, so I can generate a fairly interesting prompt

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:47):

Could be a paper there XD

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:51):

ah, but this is the problem, how do you do transparent/reproducible evals of nontrivial GPTChat applications unless you have an OpenAI co-author?

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:54):

I guess you get an OpenAI co-author?

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:54):

you can of course do experiments like this, but this is different from low-level tool evals.

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:55):

For me ChatGPT helps because I can use it to test an interaction model of writing Coq proofs that I'd like to support well, independently of the engine. And of course, it can be useful to users.

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:56):

based on the state of this repo, not sure they have many theorem proving people left...

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:57):

Karl Palmskog said:

based on the state of this repo, not sure they have many theorem proving people left...

Indeed I'm not sure either; some key people in that area left OpenAI

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:58):

this one also seems abandoned: https://github.com/openai/lean-gym (but probably there are living forks)

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 13:59):

Tho the possibility to do studies exists, as coq-lsp can almsot chat to ChatGPT and can almost run in github.dev / vscode.dev so certainly you only need to recruit users, and have them clone a repos.

view this post on Zulip Karl Palmskog (Feb 05 2023 at 13:59):

if all you aim for are user studies success, sure. But this limits scope quite a bit.

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 14:00):

I'm not sure what OpenAI is doing, however I'm not sure either that the states of these repos means a lot. If they are working on some new stuff you will see it when it is released.

view this post on Zulip Paolo Giarrusso (Feb 05 2023 at 15:16):

The explanation is pretty good but I guess ChatGPT got wrong contraL, ltnS, and I think the end's proving that p divides m! so it can't divide m! + 1?

view this post on Zulip Paolo Giarrusso (Feb 05 2023 at 15:17):

And I wonder if the explanation overlaps with that on https://jscoq.github.io/wa/, which Google seems to rank pretty high whenever I google these lemmas... OTOH, I don't actually see much in common?

view this post on Zulip Karl Palmskog (Feb 05 2023 at 15:37):

these state-of-the-art SSR proofs are one thing, but I'd like to see GPTChat make any sense whatsoever out of Coq 8.4 era semi-automated proofs... those are pretty accurate to describe as write-only

view this post on Zulip Karl Palmskog (Feb 05 2023 at 15:51):

but I guess this is the main rift between Isabelle users and Coq users: "proofs must be understandable" (math proofs) vs. "proof scripts are doing tedious stuff and there's often not much value in looking at them" (PL proofs)

view this post on Zulip Emilio Jesús Gallego Arias (Feb 05 2023 at 21:09):

Karl Palmskog said:

these state-of-the-art SSR proofs are one thing, but I'd like to see GPTChat make any sense whatsoever out of Coq 8.4 era semi-automated proofs... those are pretty accurate to describe as write-only

Indeed here something like coq-lsp can query intermediate goal state and put that in the prompt ;)

view this post on Zulip Bas Spitters (Jul 18 2023 at 06:30):

Emilio Jesús Gallego Arias said:

Most interesting for me is the button I just added "explain this proof"

@Emilio Jesús Gallego Arias
Could we have a button explain this error message?

view this post on Zulip Abhishek Anand (Oct 10 2023 at 00:11):

Skimming the ML part of this Zulip, its looks like there are some projects trying to use the public GPT API to prove mathematical theorems or suggest next steps. Is there anyone working on a public project to do a more deep integration of proof assistants with LLMs, i.e. looking for customized ways to train LLMs for this usecase? It seems the OpenAI co-founder himself seems to have a lot of interest in proof assistants, especially improving proof automation for code verfication and he seems to understand FM in good depth: see timestamp 25:30-30:00 of this video https://twitter.com/i/broadcasts/1lPKqbEWzaMGb

view this post on Zulip Bas Spitters (Oct 10 2023 at 09:30):

https://github.com/lupantech/dl4math

view this post on Zulip Jason Rute (Oct 11 2023 at 12:45):

As for papers using the GPT API this paper just dropped: https://arxiv.org/abs/2310.04353

It uses GPT-4 and compares against ProverBot9001 on Coq (Compcert). It also compares against ReProver on Lean (MiniF2F). There is an extensive discussion on the Lean Zulip here: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/COPRA

view this post on Zulip Patrick Nicodemus (Oct 11 2023 at 15:58):

It makes me a bit more optimistic about the future of AI when I see papers on integrating these chatbots with external databases.

view this post on Zulip Patrick Nicodemus (Oct 11 2023 at 16:01):

If you want the tool to be able to work with humans in the sense of integrating new theorems and definitions as the human proves them, it needs to be able to learn the meaning of new symbols without expensive training.
I wouldn't want to use a tool if it wasn't able to prove theorems using symbols that I defined.

view this post on Zulip Patrick Nicodemus (Oct 11 2023 at 16:06):

And it's necessary for the reinforcement learning problem of learning a whole library by advancing through the theorems step by step, I think, which is one of the most interesting and difficult problems. But to do this you need some kind of fine grained control over what the machine is allowed to know (things already proven) and what it's not allowed to use (things later in the data set that depend on this). There are probably simpler hacks to do this, I don't know, but intuitively I would be suspicious of any kind of "reinforcement learning" paper that trained on the whole set of theorems in some way, so that the advanced theorems were already known to the machine, even if they blocked the machine from using these results. A more honest approach imo is that the machine is only allowed to learn symbols as they are introduced and it has access to an external database at any time that it can use to look up the meanings of symbols.

view this post on Zulip Jason Rute (Oct 11 2023 at 22:56):

@Patrick Nicodemus I generally agree with you, but there is also a possible application of reinforcement learning where an AI tries and learns from proofs of theorems in any order. That is in the development of a big project. Many projects (e.g. Flyspeck, Liquid Tensor Experiment, Sphere Inversion, etc) first outline the theorems and lemmas to be proved and then start filling in the proofs. A reinforcement AI agent could (say every night) attempt to prove the remaining unproved lemmas (as well as reprove existing lemmas) and learn from any successful proofs.

view this post on Zulip Jason Rute (Oct 11 2023 at 22:58):

This is a very similar setup to many RL for theorem-proving papers (e.g. "Learning to Reason in Large Theories without Imitation" and "Reinforcement Learning of Theorem Proving").

view this post on Zulip Bas Spitters (Oct 12 2023 at 08:53):

@Jason Rute I haven't read the paper yet, but do they explain why they restrict to compcert? Only because it's the benchmark that ProverBot was using? IIUC Coq should be an attractive platform for LLMs because it has the largest library of formal proofs, but apparently noone is using that.

view this post on Zulip Karl Palmskog (Oct 12 2023 at 10:30):

from what I recall, the original ProverBot9000 paper was indeed restricted to CompCert for evaluation, so I guess they limit to where a comparison is possible

view this post on Zulip Jason Rute (Oct 12 2023 at 11:14):

@Bas Spitters Yes, the ProverBot9001 paper only tests on CompCert. There is another unpublished evaluation of ProverBot9001 on the same packages used in the CoqGym benchmark. cc @Alex Sanchez-Stern

view this post on Zulip Karl Palmskog (Oct 12 2023 at 11:21):

as to "largest library of formal proofs": we have on one hand the Coq opam repo and on the other hand Platform. But:

Unfortunately I don't see this situation changing any time soon. What one can do is curate "extensions" of the Platform releases for machine learning, but this needs to be done continually.

view this post on Zulip Karl Palmskog (Oct 12 2023 at 11:24):

we have a bunch of packages pending to be added to the Platform, but since they are "general utility", probably not enough to attract machine learning interest

view this post on Zulip Alex Sanchez-Stern (Oct 12 2023 at 19:41):

Hmm what about Coq community? Would that be a good benchmark set?

view this post on Zulip Karl Palmskog (Oct 12 2023 at 19:48):

quite a few Platform packages are hosted in Coq-community (see :rooster: symbol on website). While some of the remaining packages are high profile (:star: symbol) and would fit in a machine learning benchmark, we also have experimental packages that probably should not be used, and may not be compatible with latest Coq version.

view this post on Zulip Karl Palmskog (Oct 12 2023 at 19:49):

so the best approach would probably still be: take Coq Platform as base, and manually curate additional packages from the Coq opam archive (the highest quality Coq-community ones such as fourcolor are already there)

view this post on Zulip Alex Sanchez-Stern (Oct 12 2023 at 20:13):

Ah, for some reason I thought that new releases of Coq were tested against coq-community to make sure those are always compatible.

view this post on Zulip Karl Palmskog (Oct 12 2023 at 20:17):

a subset of the projects in Coq-community are part of Coq's CI and are tested in every commit to Coq, which is yet another distinct set of projects (not all of them packaged in the Coq opam repo): https://github.com/search?q=topic%3Acoq-ci+org%3Acoq-community&type=Repositories

view this post on Zulip Alex Sanchez-Stern (Oct 12 2023 at 20:23):

Huh, that link doesn't work for me, it just shows empty search results

view this post on Zulip Karl Palmskog (Oct 12 2023 at 20:25):

I changed the link, try again now

view this post on Zulip Alex Sanchez-Stern (Oct 12 2023 at 20:34):

Ah yeah that works. 13 projects that are used in CI from coq community, good to know. That might be a useful benchmark set for one of my projects at some point.

view this post on Zulip Théo Zimmermann (Oct 13 2023 at 08:20):

See also https://github.com/coq-community/manifesto/#position-in-the-coq-ecosystem for the difference between coq-community, Coq Platform, and Coq's CI projects.

view this post on Zulip Amitayush Thakur (Oct 13 2023 at 13:23):

Thank you for all your feedback. Please feel free to ask any questions related to our work https://arxiv.org/abs/2310.04353

view this post on Zulip Amitayush Thakur (Oct 13 2023 at 13:26):

Bas Spitters said:

Jason Rute I haven't read the paper yet, but do they explain why they restrict to compcert? Only because it's the benchmark that ProverBot was using? IIUC Coq should be an attractive platform for LLMs because it has the largest library of formal proofs, but apparently noone is using that.

We restrict to CompCert because we had a comparison available from Proverbot. Other than that we have monthly quota restrictions from OpenAI on amount of usage and GPT-4 is relatively costly. So we couldn't run on other benchmarks.

view this post on Zulip Bas Spitters (Oct 13 2023 at 13:39):

@Amitayush Thakur Thanks! How hard is it to run on new benchmark suites? math-comp comes to mind. Maybe you still have a bit of this months budget left :-)

view this post on Zulip Jason Rute (Oct 13 2023 at 13:50):

@Amitayush Thakur Where did you get the data to do the comparison with ProverBot on Compcert? (For example, the numbers and theorem lists needed to make your plots.) Did you rerun the ProverBot benchmark, or get the data from the ProverBot authors? I'd be interested in accessing this data? (I'd also be interested in the data for ASTactic, DIVA, TacTok, and Passport as well, but I think at least the ASTactic data is just lost forever.)

view this post on Zulip Amitayush Thakur (Oct 13 2023 at 14:20):

Bas Spitters said:

Amitayush Thakur Thanks! How hard is it to run on new benchmark suites? math-comp comes to mind. Maybe you still have a bit of this months budget left :-)

It is not hard to run, our tool is open and anyone can run given they have OpenAI account with access to GPT-4 (https://github.com/trishullab/copra). Some restrictions exists because we use SerAPI and CoqSerAPI which doesn't have a support beyond Coq 8.15

view this post on Zulip Amitayush Thakur (Oct 13 2023 at 14:25):

Jason Rute said:

Amitayush Thakur Where did you get the data to do the comparison with ProverBot on Compcert? (For example, the numbers and theorem lists needed to make your plots.) Did you rerun the ProverBot benchmark, or get the data from the ProverBot authors? I'd be interested in accessing this data? (I'd also be interested in the data for ASTactic, DIVA, TacTok, and Passport as well, but I think at least the ASTactic data is just lost forever.)

One can run Proverbot9001 from their GitHub repository. However, thanks to @Alex Sanchez-Stern (author of Proverbot) who shared the results from the historical runs of Proverbot, we didn't have to re-run Proverbot. The results shown in the Proverbot report are very detailed (https://proverbot9001.ucsd.edu/reports/2019-11-20T18d38d02-0700+cd762eb9e7e6e44153bd766654727a36a3dcad0b/report.html)

view this post on Zulip Bas Spitters (Oct 16 2023 at 09:08):

Amitayush Thakur said:

Bas Spitters said:

Amitayush Thakur Thanks! How hard is it to run on new benchmark suites? math-comp comes to mind. Maybe you still have a bit of this months budget left :-)

It is not hard to run, our tool is open and anyone can run given they have OpenAI account with access to GPT-4 (https://github.com/trishullab/copra). Some restrictions exists because we use SerAPI and CoqSerAPI which doesn't have a support beyond Coq 8.15

@Emilio Jesús Gallego Arias Any insights on this. I know you've been thinking about how Coq should interface with LLMs.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 17 2023 at 13:40):

Hi @Bas Spitters , thanks for the ping!

I think porting CoqSerAPI to newer Coq versions should be very easy, if not automatic; indeed I understand that they are on 8.15 out of practical reasons, moving to 8.16 doesn't add a lot of value for them I think, but Alex is the authority on CoqSerAPI. Alex gave me a very nice and detailed feedback on the improvements they'd need, so that's pending.

Regarding tooling, indeed the tooling is in poor state, specially when compared to other systems. I'm unsure how to improve the tooling, in the technical side it is for me very clear what to do; but I feel there is for example large developer "cultural" resistance to unifying Coq developments in a more uniform way so they can be consumed as a dataset, using more advanced tools in the core of the system (say PPX), etc... Moreover, as far as I know there is not project-wide AI strategy; I tried to do a working group time ago but it didn't pick up, and my student working on the topic left for a startup (crazy turnover lol).

Depending on the kind of interaction you want to do, coq-lsp could make your life very easy, at the cost of writing a plugin, happy to help if there is a more concrete problem.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 17 2023 at 13:42):

@Lasse Blaauwbroek also has very nice tooling in tactician, I'm unsure how many hacks tactician is using now, but in principle I took all of Lasse's feedback into account so hopefully he'd have a nicer time.

The main advantage that coq-lsp provides over regular Coq is that you have access to full theories and full documents.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 17 2023 at 13:43):

so you can do for example, a theory level plugin, by registering to listen to certain events.

The engine can also do a lot more, etc... for example we recently implemented speculative execution. Still IMVHO very large gains can be made in performance on the Coq side.

view this post on Zulip Emilio Jesús Gallego Arias (Oct 17 2023 at 13:43):

But IMHO tooling for LLMs in Coq remains a huge obstacle, and so far I'm very unsure I will devote any more resources to it.

view this post on Zulip Alex Sanchez-Stern (Oct 17 2023 at 17:36):

Oh, I was confused by terminology. By CoqSerAPI, you mean the proverbot9001 coq interfacing python library coq_serapy it sounds like. As distinct from the opam package coq-serapi, which is the ocaml and command line sexp interface that Emilio developed. Newer versions of coq_serapy support up through 8.18 I think, so if that's your bottlneck you should be able to update (though it looks like you included the source of an older version in your repo, so it might take some doing).

view this post on Zulip Alex Sanchez-Stern (Oct 17 2023 at 20:45):

I checked out the lean Zulip discussion and had some comments there. For those who don't want to go over there to see, it seems like there are a few threats to validity that the authors will include in their next version (we've been corresponding over email):

  1. Test leakage - the GPT-4 model was trained on github in 2021, including the CompCert and miniF2F datasets. While there doesn't seem to be evidence that the model has memorized any of the test proofs directly, more subtle forms of test leakage have not been discounted.
  2. Not accounting for cost of queries - The abstract claims that COPRA is "faster" than other tools it compares to, where "faster" means "using fewer queries". However, the cost of a query isn't constant across the tools, and in fact queries in Proverbot9001 appear to run 100-1000x faster on a CPU than the COPRA queries do using the OpenAI datacenter hardware.
  3. It isn't quite possible to measure pass@k-inferences on Proverbot9001, so they use a proxy to compare whose accuracy can vary.
  4. The comparison uses a version of Proverbot9001 from 2019 that performs significantly worse than modern Proverbot9001. The newest results are unpublished (well, they're published in a web page), so that's fair, it's just good to know that the Proverbot9001 approach can do a bit better, affecting the comparison.

view this post on Zulip Jason Rute (Oct 18 2023 at 16:17):

@Alex Sanchez-Stern Is there a plan to publish the newer ProverBot9001 methodologies and results in a paper? Is ProverBot9001 still being improved and developed? Is there any plan to make it something Coq users can use?

view this post on Zulip Alex Sanchez-Stern (Oct 18 2023 at 17:22):

I don't think we're planning on publishing the improvements to base proverbot as they're mostly engineering improvements, which are hard to publish. We're currently building on Proverbot9001 for a few projects at UMass Amherst, UCSD, and UCLA, and as we do that we're constantly making improvements to the base model. In terms of usability, it's open source and works on the latest version of Coq, the two usability issues I see are (1) it doesn't have a plugin interface that can suggest proofs directly in the IDE, you have to run it externally on your project from the command line (it can generate HTML or .v formatted solutions), and (2) setup can be tricky as with any research code, though at this point 10-20 people I know have set it up.

view this post on Zulip Bas Spitters (Oct 18 2023 at 18:26):

Alex Sanchez-Stern said:

Oh, I was confused by terminology. By CoqSerAPI, you mean the proverbot9001 coq interfacing python library coq_serapy it sounds like. As distinct from the opam package coq-serapi, which is the ocaml and command line sexp interface that Emilio developed. Newer versions of coq_serapy support up through 8.18 I think, so if that's your bottlneck you should be able to update (though it looks like you included the source of an older version in your repo, so it might take some doing).

@Amitayush Thakur does that solve your issue?

view this post on Zulip Bas Spitters (Oct 18 2023 at 18:28):

Emilio Jesús Gallego Arias said:

Hi Bas Spitters , thanks for the ping!

I think porting CoqSerAPI to newer Coq versions should be very easy, if not automatic; indeed I understand that they are on 8.15 out of practical reasons, moving to 8.16 doesn't add a lot of value for them I think, but Alex is the authority on CoqSerAPI. Alex gave me a very nice and detailed feedback on the improvements they'd need, so that's pending.

Regarding tooling, indeed the tooling is in poor state, specially when compared to other systems. I'm unsure how to improve the tooling, in the technical side it is for me very clear what to do; but I feel there is for example large developer "cultural" resistance to unifying Coq developments in a more uniform way so they can be consumed as a dataset, using more advanced tools in the core of the system (say PPX), etc... Moreover, as far as I know there is not project-wide AI strategy; I tried to do a working group time ago but it didn't pick up, and my student working on the topic left for a startup (crazy turnover lol).

Depending on the kind of interaction you want to do, coq-lsp could make your life very easy, at the cost of writing a plugin, happy to help if there is a more concrete problem.

Thanks @Emilio Jesús Gallego Arias Do you have a write up/roadmap of what needs to be done ?
I'd imagine there's at least some interest from Coq (library) developers.

view this post on Zulip Bas Spitters (Oct 18 2023 at 18:36):

@Alex Sanchez-Stern If there's sufficient progress that the community should know about, but not a full paper, you could consider submitting it to https://popl24.sigplan.org/home/CoqPL-2024#Call-for-Presentations

view this post on Zulip Bas Spitters (Nov 03 2023 at 09:45):

Machine Learning ITP dataset (lean and agda) https://arxiv.org/abs/2310.16005 by @Andrej Bauer
Unfortunately, Coq is missing.

view this post on Zulip Jason Rute (Nov 03 2023 at 15:15):

I provide some commentary on that dataset over at the Machine Learning channel of the Lean zulip. I don't think that data is necessarily that fancy. I think (but I could be mistaken) it is just the terms for each declaration (theorem, definition, axiom, etc) along with some data for recovering dependencies. If so, this information is readily available in a few places in Lean such as the olean files and by querying the Lean environment inside Lean. I assume you can similarly get it for Coq (although Coq has the challenge of 100s of different Coq packages). One could also look to existing datasets like CoqGym for possibly similar data.

view this post on Zulip Jason Rute (Nov 03 2023 at 15:16):

I'm not the most impressed with their experiments from what I've seen. It seems they are using quite old techniques and it isn't immediate how applicable their results are to real concerns in machine learning theorem proving. Their link prediction task seems to be a form of premise selection. Nonetheless, I do think it is time we revisit premise selection. Premise section and argument selection are key to theorem proving and most papers have something like a premise selection tool inside them. (For example, MagnusHammer is a premise selection tool that largely replaces the external ATP call in SledgeHammer from what it seems.) It would be really interesting to try to compare many different premise selection methods from across the literature. (This is easier said than done of course.). There was once a premise selection dataset called HOLStep, but people don't use it anymore, and none of the modern premise selection approaches are represented in the HOLStep results. Maybe it is time to recover the topic.

view this post on Zulip Karl Palmskog (Nov 03 2023 at 15:32):

if their main thing is sexprialized terms for declarations, then we collected a bunch of those for MathComp+ecosystem when we did learning to suggest lemma names: https://github.com/EngineeringSoftware/math-comp-corpus

I'm sure the same dataset could be refreshed for recent Coq+MathComp, most projects are now in Coq Platform (and so are the tools for generating sexps)

view this post on Zulip Abhishek Anand (Nov 06 2023 at 01:45):

TLDR: In my opinion, we need to collect very different datasets: the datasets should record how humans actually develop Coq proofs and programs rather than the polished end results. We need to record our struggles with the Coq typechecker and the queries (e.g. Search) we invoke to find the right stuff. I will get to the "why" later, but first, I have 2 suggestions for getting such datasets and they may appeal differently to people with varying levels of privacy concerns.
1) Request the Coq community to not delete the query commands they use to produce their proofs. They can remain in the MR/PR (merge/pull request) until the MR/PR is ready to merge and some git protocol should be followed to ensure that the unclean version with all the queries remains in the history (there are many ways to do this). Since my ML research days (decade ago), I have appreciated how much data is gold. Sadly, most of my colleagues in the theorem-proving community rewrite their git history (not just the final version) to look very clean and thus lose information.
2) Instrument Coq IDEs to record the process of coming up with the right Coq function/proof. While writing a Coq definition/Fixpoint/proofs, this should include not just the final version but also the versions that had typechecker/tactic errors, the error messages, and the edits the IDE users did to fix them. It should also include the queries (Search/About/Check/Print...) the humans did during the process. I know many will have privacy concerns. But at least I would not have minded as long as I can review the final data before it is submitted for training AI models. Some employers may have concerns about leaking secret sauces but I guess most of the academia will not. Those employers may also collect the data anyway but not release it until they decide to open-source the proofs: it would anyway be hard in the long run to sell a proven product and not release the proofs.

WHY?
I guess nobody knows why modern neural networks work so well but it seems most of the breakthrough ideas in the field have come from thinking about and mimicking how humans (brains) work: from the inception to ideas like attention heads. So I think it is worthwhile thinking about how humans write Coq proofs/functions or even essays/papers. We never come up with the final version right away: it is often a long back and forth. To me it seems odd that the current LLM architectures do not have "backspace/left-arrow/delete" in their keyboard: they can only predict the next word even when they seem to already know that they made a mistake. Imagine humans writing a Coq program or an essay in one shot when your keyboard has no backspace/left-arrow/delete buttons: I would be extremely bad at it. For Coq programs/proofs, imagine writing them without being able to issue queries or look at typechecker errors.

A prompting loop can address some limitations but doesn't seem to go very far. For example, a few weeks ago, I tried a manual prompt loop where I fed error messages and tried to get ChatGPT4 to write a correct vec.append function recursive on the second argument, something that human beginners in Coq struggle with. It was able to correct some errors but did not get it fully right. I don't think that just feeding it the errors would have sufficed as it wasn't making progress. Then I asked ChatGPT to come up with a Search query for the thing it was struggling with. It came up with the right query and used the result I gave it to finally write the correct vec.append. The full interaction can be found here (my initial prompt was imprecise and it took me 3 prompts to precisely specify what I was asking ChatGPT to do) https://chat.openai.com/share/63cb977c-6022-4ee0-8618-8c6d04cb56b4

Later, I tried and failed to get ChatGPT to ask for query commands on its own when needed, rather than me reminding it at the right times in a prompt loop: https://chat.openai.com/share/75178430-7f60-41bd-9e92-0f56c31e36df

I don't think such prompting tricks will go far. I am not a neural network expert, but I think the architecture should be changed a bit to give ChatGPT more power to correct its mistakes and to ask for external help. Currently, the architecture allows only one action: predict the next word(let). It should be allowed to do the following action as well: delete the previous word. Moreover, in the context of Coq/Lean, it should also be allowed to choose the next step to be: evaluate a query (Check/Search/Print..) and use the result

To evaluate the hypothesis above, we would need a lot of new training data that can be collected if my suggestions at the top of this post are incorporated by theorem-proving communities (remember, data is the modern gold). Even then, it may take years to get a large enough dataset. For now, I think somebody can already do a similar experiment to test parts of the hypothesis: get online editing services like Google Docs, and Microsoft Word online, to just add the "delete last word" action in the architecture and train LLMs using human editing histories (hopefully they preserved them) and see if it improves the qualities of essays LLMs write, e.g. reduce hallucination. I neither own the GPUs or data to do this experiment, nor do I know who to ask to get this experiment done, but maybe someone here is well-connected.

view this post on Zulip Karl Palmskog (Nov 06 2023 at 14:54):

@Abhishek Anand there is this dataset that recorded low-level interaction with Coq for some volunteer users: https://github.com/uwplse/analytics-data

view this post on Zulip Bas Spitters (Nov 09 2023 at 10:42):

Some more training on Lean and Isabelle datasets:
https://arxiv.org/abs/2310.10631

view this post on Zulip Jason Rute (Nov 09 2023 at 21:50):

That one also trains on Coq, but only on the raw Coq .v files IIRC, whereas it trains on proof-state/tactic pairs as well for Isabelle and Lean.

view this post on Zulip Jason Rute (Nov 09 2023 at 21:53):

@Sean Welleck gave a talk recently on Llemma https://m.youtube.com/watch?v=bRHb-MVExJ4&pp=ygUfI21hbGxhYmh1bWluc3RpdHV0ZW9mdGVjaG5vbG9neQ%3D%3D

view this post on Zulip Abhishek Anand (Dec 01 2023 at 14:14):

has anyone used OpenAI's fine-tuning API to make ChatGPT better at problems related to Coq proving/programming? I am thinking about using fine tuning to teach it to do queries instead of hallucinating, but still uncertain about how to go about it.

view this post on Zulip Bas Spitters (Dec 03 2023 at 16:23):

@Abhishek Anand the instructions here do not work for me. Perhaps it needs a paid version
https://openai.com/blog/introducing-gpts

view this post on Zulip Karl Palmskog (Dec 03 2023 at 16:26):

I think it'd be nicer for the community if someone did fine-tuning for Coq on top of the Facebook's LLaMA(2) models, then at least researchers could run the model directly

view this post on Zulip Jason Rute (Dec 03 2023 at 16:42):

Karl Palmskog said:

I think it'd be nicer for the community if someone did fine-tuning for Coq on top of the Facebook's LLaMA(2) models, then at least researchers could run the model directly

The Llemma model?

view this post on Zulip Karl Palmskog (Dec 03 2023 at 17:28):

in that direction, sure, but I believe Llemma was tuned on lots of stuff that is unlikely to help solving Coq-specific tasks ("a mixture of scientific papers, web data containing mathematics, and mathematical code")

view this post on Zulip Karl Palmskog (Dec 03 2023 at 17:31):

as I think we established in the Coq community survey, a large chunk of Coq (theory, practice) overlaps with programming rather than pure math

view this post on Zulip Jason Rute (Dec 03 2023 at 17:38):

It was built on top of Code-LLaMa so there is also a lot of programming code underneath, and it was fine-tuned on Coq files (among other ITP files). Of course, if Coq is the main goal, there might be a better Coq-specific training dataset, which could even depend on the specific Coq task. But at some point, the cost of fine-tuning for each specific task is too much. I don't know what you all have in mind, but it might be worth trying Llemma first before building a Coq-specific LLaMA model.

view this post on Zulip Jason Rute (Dec 03 2023 at 17:46):

For reference, training Llemma 7B was 23k A100-hours. Training Llemma 34B was 47k A100-hours. Of course, there may be more efficient methods like LoRA, etc. to fine-tune a model.

view this post on Zulip Abhishek Anand (Dec 03 2023 at 23:18):

Bas Spitters said:

Abhishek Anand the instructions here do not work for me. Perhaps it needs a paid version
https://openai.com/blog/introducing-gpts

Sorry for not being precise earlier: I meant this API https://platform.openai.com/docs/guides/fine-tuning
(custom GPTs seem to be a different thing https://community.openai.com/t/custom-gpts-vs-fine-tuning-whats-the-difference/477738/4)

Thanks Karl and Jason for the suggestions about open-source models.

view this post on Zulip Emilio Jesús Gallego Arias (Dec 05 2023 at 18:22):

Bas Spitters said:

Thanks Emilio Jesús Gallego Arias Do you have a write up/roadmap of what needs to be done ?
I'd imagine there's at least some interest from Coq (library) developers.

I have plenty of notes, and I tried to do a machine learning working group (https://github.com/coq/coq/wiki/Coq-Topic-Working-Group-Document-Data-and-Machine-Learning-Platform) and we had a couple of meetings.

But in the end, coordination was too hard, and I am very short on time. But it'd be nice to maybe try to resurrect the working group?

view this post on Zulip Abhishek Anand (Dec 06 2023 at 23:47):

Abhishek Anand said:

has anyone used OpenAI's fine-tuning API to make ChatGPT better at problems related to Coq proving/programming? I am thinking about using fine tuning to teach it to do queries instead of hallucinating, but still uncertain about how to go about it.

one interesting (IMO) experiment that should be easy to do is to fine tune ChatGPT or other models on the data collected by the REPLICA work (https://github.com/uwplse/analytics-data).

view this post on Zulip Bas Spitters (Jul 13 2024 at 09:01):

A custom Llama model for autoformalization of Lean4 statements. It does better on the MiniF2F benchmark, which does not include Coq. They seem to be unaware of the work by @Jason Rute and @Lasse Blaauwbroek
https://arxiv.org/pdf/2407.03203

view this post on Zulip Jason Rute (Jul 13 2024 at 12:41):

Thanks Bas. This paper, while similar to our work in its goal (prove theorems automatically), takes a very different path using LLMs to generate the whole proof in one go (with 128 attempts). In that regard, it is very similar to the DeepSeek-Prover paper, both in the whole proof generation and in getting better data through auto-formalization. (DeepSeek-Prover does much better, but I think they also did more training and used more computation in the end.) I still need to read more about this paper, especially the part about Natural Language Guidance. I can't tell if it is similar to the natural language guidance in Draft-Sketch-Prove and its successors. If I read more of the paper, I might summarize it later on the Lean Zulip (in the "Machine Learning for Theorem Proving" stream).

view this post on Zulip Jason Rute (Jul 13 2024 at 12:49):

There is a nice discussion about the DeepSeek-Prover paper on the Lean Zulip at https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/DeepSeek-Prover .

view this post on Zulip Jason Rute (Jul 13 2024 at 12:52):

Also, as you said, since MiniF2F isn't in Coq and I think both of these papers are targeting competition-style math problems, rather than real-world theorem proving, it is hard to directly compare their work to ours. That is the Achilles heel of this field.

view this post on Zulip Bas Spitters (Jul 13 2024 at 12:52):

Thank @Jason Rute Is my recollection correct that informalization/autoformalization of Coq proofs is still mostly open? I had a quick peek here https://github.com/lupantech/dl4math, but that project seems to have stalled.

view this post on Zulip Jason Rute (Jul 13 2024 at 12:54):

I'm not sure if anyone is working on it, but I also think the ideas would be the exact same as in Lean or Isabelle. The challenge would be just to extend it to Coq.

view this post on Zulip Bas Spitters (Jul 13 2024 at 12:59):

There might be an advantage of having a tactic language closer to NL (like Isabelle and Lean). We had Czar in Coq, but it was discontinued.
I guess we'll have to wait and see...

view this post on Zulip Jason Rute (Jul 13 2024 at 13:01):

I'm not sure if anyone is yet using auto-formalization to do actual real-world formalization in Lean, Coq, or Isabelle. It seems it is mostly used to test AI systems, construct training data for other AI systems, and convert informal proof sketches into formal proof sketches which can be filled in with automation. Autoformalization of definitions and new concepts would be very different from auto-formalization of theorem statements with standard mathematical objects. We only really have a grasp on the latter right now.

view this post on Zulip Jason Rute (Jul 16 2024 at 11:09):

I just found an undergraduate thesis about using LLMs in Coq, including a form of proof repair that improves results. https://dash.harvard.edu/handle/1/37379274 . (The author doesn't appear to be on this Zulip, but their advisor @Nada Amin appears to be.) They appear to test on software foundations. Their GPT-4 model with "lemma extraction" can solve 79.42% of theorems while CoqHammer (cvc4 backend with 20s timeout) solves 78.73%. Good work for an undergrad thesis! Their "lemma extraction" algorithm in section 4.2.1 appears to be novel and it seems to be a sophisticated take on how to repair a bad proof from a language model. Namely, you find the place where the error occurs, and try to get the language model to generate a proof for that goal (converting it into a theorem). So it seems like it is a form of tree search, but taking big steps, only branching on error states, which I think is a good way to make tree search more amenable to LLMs. It's much more sophisticated than what several other papers do where they just generate 128 proofs independently. (I think their "lemma extraction" method is inspired by the proof-repair component of Baldur by @Emily First and @Talia Ringer, which is itself a sophisticated take on whole proof generation with LLMs and proof repair.)

view this post on Zulip Jason Rute (Jul 16 2024 at 11:10):

I'm not sure I fully understand their GPT-4 baseline. Do they generate one proof? Many proofs? Because if they are comparing GPT-4 with just one call to GPT-4 vs their "lemma extraction" method which has many calls to GPT-4, it might not be a fair comparison. I'd like to see at least a similar amount of computing used for the baseline. (They might do this, I can't find the details.)

view this post on Zulip Jason Rute (Jul 16 2024 at 11:10):

As for how it compares to other AI for Coq projects, like Graph2Tac, ProverBot9001, and Tactician, it is hard to know. I think testing on software foundations is quite limited. For one, it is popular, right? So there may be a lot of solutions already on the Internet in GPT-4's training data. And it seems like a fairly easy benchmark with CoqHammer already getting ~79%. I would love a comparison on a more interesting benchmark. (This is really the biggest problem in this field, namely a lack of good benchmarks, especially ones that haven't already been compromised by LLM pretraining.)

view this post on Zulip Michael Soegtrop (Jul 16 2024 at 11:31):

I wonder what a "Bachelor with Honors" thesis is - some sort of in depth study after a bachelor to get an orientation for a later master study?

view this post on Zulip Jason Rute (Jul 16 2024 at 12:22):

In some US colleges to get honors with your bachelor degree, you have to write a thesis. It is like a mini masters thesis. Some particularly good students do impressive original research.

view this post on Zulip Michael Soegtrop (Jul 16 2024 at 14:17):

I see - I thought a bachelor always requires a thesis (in Germany this is so afaik). Indeed impressive work!

view this post on Zulip Bas Spitters (Jul 16 2024 at 15:49):

This indeed looks nice. However, they only do Logical Foundations, not the whole Software Foundations series. It would be interesting to benchmark the other books too...
They don't compare to Tactician and Graph2Tac. I'm curious how well they do on LF.

view this post on Zulip Bas Spitters (Jul 16 2024 at 15:53):

A couple of years ago, one of my students implemented RoosterSpec https://github.com/mikkelmilo/rooster-spec
A Coq implementation of QuickSpec. It is known that such intermediate lemmas help automation, e.g. for Vampire
https://link.springer.com/chapter/10.1007/978-3-031-63498-7_13
Something similar might also work for LLMs.

view this post on Zulip Bas Spitters (Jul 17 2024 at 16:22):

There is also Moogle for to search lean libraries using natural language:
https://morph.so/blog/the-personal-ai-proof-engineer/
https://www.moogle.ai/

I wonder whether the Coq has the datasets to allow training such a tool.

view this post on Zulip Talia Ringer (Jul 18 2024 at 23:24):

A masters student of mine just did a master's thesis about a search engine
for Coq, and will talk about it at AITP. Not sure when the thesis will be
available online though.

view this post on Zulip Bas Spitters (Jul 19 2024 at 07:25):

Great @Talia Ringer is the engine available?

view this post on Zulip Talia Ringer (Jul 19 2024 at 23:52):

I just forwarded this to the student, hopefully he can chime in

view this post on Zulip Jason Rute (Jul 20 2024 at 14:21):

Another undergrad thesis on LLMs for Coq, this one by Pierce Malony at Princeton (not in this Zulip): http://arks.princeton.edu/ark:/88435/dsp01w3763b12c I can’t tell if it is freely available anywhere. I’d would be interested to learn more about this project (but also worry about benchmarking LLMs like GPT-4 on CoqGym since there is a lot of possibility of data leakage).

view this post on Zulip Jason Rute (Jul 20 2024 at 14:25):

Oh, the paper is here: https://github.com/piercemaloney/senior-thesis/blob/main/written_final_report.pdf

view this post on Zulip Jason Rute (Jul 20 2024 at 15:28):

There are a number of interesting ideas here, including putting a lot of context in the prompt, using A* by using GPT-4 as an evaluator to estimate the score of a given proof state, and more. As for the evaluations, while great for an undergraduate thesis, it is hard to know what to conclude here. The results look good, but it is hard to know how it compares to other AI for theorem proving, or even AI for Coq models:

  1. The number of test theorems is fairly small.
  2. I wonder how expensive this is to run, given that it is a search (A*) with many calls to a LLM. It might not be too bad given that it is, I think, only 25 queries per proof attempt (which, if I'm correct, is orders of magnitude fewer queries than in Graph2Tac), but that can still add up.
  3. The test theorems are already probably in the training data for the models. For Llemma in particular, one can even manually check how much of the test theorems got into the training data since the training data is public and includes a significant amount of Coq code. (I haven't myself checked. Also, even if it isn't in the Llemma training data, it could be in the previous code-LLaMa training data.)

view this post on Zulip Jason Rute (Jul 20 2024 at 15:30):

For these reasons alone, I'm very reluctant to say with any certainty that this approach will generalize to new theorems. But at the same time, it might. Lately, several papers have shown that fine-tuned LLMs do fairly well at generating ITP proofs.

view this post on Zulip Jason Rute (Jul 20 2024 at 21:02):

And here is another paper published in a journal: https://dl.acm.org/doi/abs/10.1145/3671016.3674823 (I’m not on my work computer so I can’t access it. I find it funny it calls COPRA “state of the art”. :smile: It also has the problem of testing LLM based solutions on CompCert.

view this post on Zulip Jason Rute (Jul 31 2024 at 12:41):

Bas Spitters said:

There is also Moogle for to search lean libraries using natural language:
https://morph.so/blog/the-personal-ai-proof-engineer/
https://www.moogle.ai/

I wonder whether the Coq has the datasets to allow training such a tool.

There are now two more semantic search projects for Lean that are a bit more open. You might be able to borrow what they are doing.

view this post on Zulip Bas Spitters (Aug 14 2024 at 08:13):

@Talia Ringer any news from your student?


Last updated: Oct 13 2024 at 01:02 UTC