Stream: Machine learning and automation

Topic: CoqPilot


view this post on Zulip Bas Spitters (Nov 20 2023 at 10:29):

Just spotted this one. Has anyone tried it?
https://marketplace.visualstudio.com/items?itemName=JetBrains-Research.coqpilot

view this post on Zulip Michael Soegtrop (Nov 20 2023 at 11:08):

Unfortunately for my day work I cannot use such systems, since they would upload stuff to servers - which we don't want. Are there models one can run locally? How much GPU RAM do such models require? Testing it on toy stuff is likely not helpful to evaluate it.

view this post on Zulip Karl Palmskog (Nov 20 2023 at 11:10):

if they only support openai models (which is claimed in the description), then no local models are currently supported. I guess we wait for someone to do the extension to a downloadable model.

view this post on Zulip Karl Palmskog (Nov 20 2023 at 11:12):

for example, Facebook "open access" LLMs (LLaMa) have variants that fit in 8GB GPU RAM, but for reasonable performance, 16 GB RAM or even 24 GB RAM is probably good to be able to use LLaMa with more parameters.

view this post on Zulip Michael Soegtrop (Nov 20 2023 at 11:18):

24 GB wouldn't be a problem. I more feared that one needs a 8 high end GPU machine or the like.

view this post on Zulip Karl Palmskog (Nov 20 2023 at 11:20):

I think the biggest released Facebook LLM can be run with 48 GB RAM. But as expected, it performs much worse than the models behind an API, at least on some tasks. Nevertheless, reasonable for use by paranoid power users, who might even do their own additional training on top of it.

view this post on Zulip Michael Soegtrop (Nov 20 2023 at 11:42):

What I don't quite get: these models generally understand Coq as is, or does one have to use a specially trained model?

view this post on Zulip Michael Soegtrop (Nov 20 2023 at 11:43):

Is it so that with a smallish model Coq is basically supported but unusable and with the big server models the Coq support is at leats not completely unreasonable?

view this post on Zulip Karl Palmskog (Nov 20 2023 at 11:45):

I haven't done research on LLMs, but the general idea is that the model has a general knowledge of lots of languages which can be accessed through prompt engineering: "write a gcd function in Coq/SSReflect". Additional training could refine the responses to fit what Coq users expect, such as giving .v code examples with comments

view this post on Zulip Karl Palmskog (Nov 20 2023 at 11:46):

the main drawback of current non-tuned LLMs for Coq in my view is that they have never "seen" proof states, since this was not available in their general training data - they have only seen .v files and other Vernacular fragments

view this post on Zulip Karl Palmskog (Nov 20 2023 at 11:47):

we discovered using traditional neural learning techniques that even Coq code formatting by proof engineers typically takes proof state into account in a nontrivial way

view this post on Zulip Karl Palmskog (Nov 20 2023 at 12:01):

bottom line, I think the smaller models (7B parameters, <= 8 GB RAM) are going to do Coq Gallina code suggestion quite well, and proofs quite badly. The situation with proofs might improve a bit if you move to 65B parameter models, and probably the Gallina writing might be a lot better, or at least have potential to be better with question-response tuning.

view this post on Zulip Karl Palmskog (Nov 20 2023 at 12:02):

but people more familiar with LLM research might be able to contradict me...

view this post on Zulip Jason Rute (Nov 20 2023 at 12:08):

@Michael Soegtrop Tactician runs locally (and adapts instantly to your projects.). Proverbot9001 and all the CoqGym family of models (ASTactic, TacTok, DIVA, Passport) also run locally, but they are research software and I don’t know how easy they are to use.

view this post on Zulip Karl Palmskog (Nov 20 2023 at 12:10):

but I think the question here was specifically about models for applying CoqPilot? I wonder how many of ProverBot, etc., would support this workflow (coq-lsp is cutting edge Coq)

view this post on Zulip Jason Rute (Nov 20 2023 at 12:18):

As for models which have seen proof states, the recent Llemma model has seen Lean and Isabelle proofs states (but not Coq). It isn’t really quite clear is GPT-4 has seen proof states. Daniel Selsam works at OpenAI and used to be a Lean developer. Also Open AI used to do theorem proving research on Lean. Also I have one GitHub project that has Lean proofstates in the hovertext of an html file. Also if they captured CoqGym (not sure how easy the dataset is to ingest, they might also get Coq proofstates. I guess the point is no one knows.

view this post on Zulip Karl Palmskog (Nov 20 2023 at 12:40):

OK, let me rephrase defensively: the mistakes/inadequacies I observe in Coq code generated with current LLMs (Facebook LLaMA, ChatGPT, Copilot, etc.) are consistent with the models not being aware of Coq proof states, and also consistent with what we found lacking in custom neural models that were not aware of proof states (but only of Coq syntax, tokens, AST)

view this post on Zulip Michael Soegtrop (Nov 20 2023 at 12:51):

@Karl Palmskog @Jason Gross : thank you for the explanations and sharing your insights - very helpful!

view this post on Zulip Jason Rute (Nov 20 2023 at 12:51):

As for this project, I wonder how it works. I also wonder if they will produce a paper or share the details otherwise. My guess is (like others have suggested) it doesn’t work so well, but that it is hard to tell since there aren’t good benchmarks (and it requires access to OpenAI’s models). By the way Copra is another paper using GPT-4 for proofs in Lean and Coq. I think the big thing this project provides is a nice user interface.

view this post on Zulip Bas Spitters (Nov 20 2023 at 13:10):

Copra: https://arxiv.org/abs/2310.04353

view this post on Zulip Bas Spitters (Nov 20 2023 at 13:12):

@Anton Podkopaev is one of the developers of Coqpilot. I hope he'll be able to answer the questions above.

view this post on Zulip Jason Rute (Nov 20 2023 at 13:16):

My big question for @Anton Podkopaev is the following: Is it just more or less outputting a proof as a code completion task, or is it doing a proof search? Also, it is just proof completion right now, correct? There aren't other coding tasks yet, right?

view this post on Zulip Karl Palmskog (Nov 20 2023 at 13:32):

it seems tempting to suggest that "proper" training for AI systems to do Coq proofs would involve low-level learning of primitives for manipulating the proof context, such as the nice set of primitives here.

At least I've always felt skeptical about the "Ltac completion" approach, not least because: that's not how real proof engineers do it.

view this post on Zulip Anton Podkopaev (Nov 20 2023 at 13:38):

Hi @Jason Rute , for now, CoqPilot does the following thing. Once a user invokes the corresponding action, CoqPilot takes proof states before each admit in the file and sends the request to ChatGPT to fill it. Then, ChatGPT sends us n potential solutions for each hole (30 is default one, but might be changed in settings), then we typecheck them and if a solution is accepted, we insert it instead of admit.

view this post on Zulip Anton Podkopaev (Nov 20 2023 at 13:40):

We don't try to do regular completion.

We are thinking on supporting local LLMs instead of ChatGPT remote calls.

view this post on Zulip Karl Palmskog (Nov 20 2023 at 13:46):

@Anton Podkopaev are there examples of data sent to ChatGPT for some nontrivial Coq file somewhere? Would be interesting to feed one manually to different LLMs just to get a general idea of how they do

view this post on Zulip Michael Soegtrop (Nov 20 2023 at 13:49):

@Anton Podkopaev : I wonder if you tell ChatGPT in some way which of the 30 solutions was successful. If not, how does it improve on this?

view this post on Zulip Jason Rute (Nov 20 2023 at 13:55):

@Anton Podkopaev The one-shot experiments in the Copra paper, as well as the personal experience of anyone who has tried to use ChatGPT to fill in a proof in Coq (or Lean) suggests this approach isn't that strong (especially compared to all the other work in this field). Is there a reason you think this is a value add to users? Is the user interface the real selling feature here? Do you find it is actually very useful in practice? Or is this just a stepping stone to better approaches?

view this post on Zulip Jason Rute (Nov 20 2023 at 14:00):

Although, as I've often said the true test is the end users, so if people find value, that is good enough, and it is quite commendable you have an easy-to-use tool.

view this post on Zulip Anton Podkopaev (Nov 20 2023 at 14:00):

@Jason Rute, for now, CoqPilot is an MVP. We plan to implement and compare other strategies, but I believe that this is a nice interface.

view this post on Zulip Abhishek Anand (Nov 20 2023 at 14:00):

Anton Podkopaev said:

Hi Jason Rute , for now, CoqPilot does the following thing. Once a user invokes the corresponding action, CoqPilot takes proof states before each admit in the file and sends the request to ChatGPT to fill it. Then, ChatGPT sends us n potential solutions for each hole (30 is default one, but might be changed in settings), then we typecheck them and if a solution is accepted, we insert it instead of admit.

you may get better results by sending the error messages back to chatgpt asking for a fix, and also encouraging it to ask queries instead of hallucinating lemma names:
https://chat.openai.com/share/63cb977c-6022-4ee0-8618-8c6d04cb56b4

view this post on Zulip Anton Podkopaev (Nov 20 2023 at 14:02):

@Michael Soegtrop currently, we are not sending any feedback to ChatGPT, but we’ll definitely experiment on these stategies.

view this post on Zulip Jason Rute (Nov 20 2023 at 14:06):

I would love it if work like this also was able to publish performance on benchmarks so that it could be compared to other work. I understand comparing automated theorem provers for ITPs (especially Coq) is a huge pain, but it would help a lot.

view this post on Zulip Anton Podkopaev (Nov 20 2023 at 14:08):

Yes, I agree, we definitely plan to have benchmarks eventually :)

view this post on Zulip Anton Podkopaev (Nov 20 2023 at 14:09):

BTW, we are happy to collaborate on the tool, if anyone is interested )

view this post on Zulip Bas Spitters (Nov 20 2023 at 14:22):

Have you considered autoformalization?
Would custom GPTs be helpful here? I haven't looked at how hard they are to build?
https://openai.com/blog/introducing-gpts

view this post on Zulip Andrei Kozyrev (Nov 20 2023 at 14:27):

Karl Palmskog писал/а:

Anton Podkopaev are there examples of data sent to ChatGPT for some nontrivial Coq file somewhere? Would be interesting to feed one manually to different LLMs just to get a general idea of how they do

Hi @Karl Palmskog! For now the main purpose of sending data except for the theorem we are proving is to fix the format of the output for the LLM, i.e. GPT. The problem is that we want GPT response to only contain a specific format coq code, e.g Proof. ${proof} Qed. To achieve this, we send some simple system propmt and then a history of messages in the following way:
user: Theorem test : P.
assistant: Proof. ... Qed.
...

This history is being taken from the file and contains only the theorems and only those that have valid proofs.
This also gives GPT the idea about theorems we have in context and proofs of potentially similar goals.

All in all, we wanted to create a framework as modular around the LLM as possible and would love to experiment on other models than GPT.

view this post on Zulip Andrei Kozyrev (Nov 20 2023 at 14:29):

@Bas Spitters, yes, we have looked into it. Might be helpfull. There are many ideas and ways of improvement. Right now we have a plain baseline. Thank you for the ideas!

view this post on Zulip Karl Palmskog (Nov 20 2023 at 14:50):

thanks, Andrei. Then I guess the most challenging limitations in open access LLM models will be the size of the context, which will be dependent on the size of the Coq file. Without fine-tuning, I think some models max out at a couple of thousand tokens of context...

view this post on Zulip Andrei Kozyrev (Nov 20 2023 at 16:38):

Yes, you are right, unfortunatelly it is especially the case, for example, for gpt3.5, and only around ~10-12 small-average size theorems fit into the limitation. So if we exceed it, we randomly choose the theorems until the limit is reached or choose those which are around the theorem to be solved.
For gpt4 however the limit is times bigger, and as we have yet observed, straight out of the box results are better, but the model is times more expensive.

view this post on Zulip Karl Palmskog (Nov 20 2023 at 16:41):

hmm, instead of choosing randomly, you might want to run premise selection as in CoqHammer? https://github.com/lukaszcz/coqhammer/tree/coq8.18/src/predict (this predict executable will be available to anyone that installs the Coq Platform)

view this post on Zulip Andrei Kozyrev (Nov 20 2023 at 16:42):

Abhishek Anand писал/а:

Anton Podkopaev said:

Hi Jason Rute , for now, CoqPilot does the following thing. Once a user invokes the corresponding action, CoqPilot takes proof states before each admit in the file and sends the request to ChatGPT to fill it. Then, ChatGPT sends us n potential solutions for each hole (30 is default one, but might be changed in settings), then we typecheck them and if a solution is accepted, we insert it instead of admit.

you may get better results by sending the error messages back to chatgpt asking for a fix, and also encouraging it to ask queries (e.g. Search) instead of hallucinating lemma names:
https://chat.openai.com/share/63cb977c-6022-4ee0-8618-8c6d04cb56b4

Hi @Abhishek Anand! Yes, we also feel it can improve performance. Hopefully this could be added in milestone 2.0.0.

view this post on Zulip Andrei Kozyrev (Nov 20 2023 at 16:46):

Karl Palmskog писал/а:

hmm, instead of choosing randomly, you might want to run premise selection as in CoqHammer? https://github.com/lukaszcz/coqhammer/tree/coq8.18/src/predict (this predict executable will be available to anyone that installs the Coq Platform)

Thanks, Karl! I am not well familiar with how CoqHammer works, but will dive into this!

view this post on Zulip Abhishek Anand (Dec 01 2023 at 18:23):

Does someone know what features of emacs+proof-general+company-coq CoqPilot is missing? Does CoqPilot have jump to definition, basic syntax highlighting, some autocomplete smarter than just suggesting words in the current document (e.g. company coq queries Coq for all constants visible at the current position, along with their minimally+sufficiently qualified names)? I am thinking about instrumenting some currently popular open-source Coq IDE with features to record how humans do proof, like the REPLICA work. I currently use company-coq but it seems most Coq programmers are switching or have already switched to vscoq.

view this post on Zulip Karl Palmskog (Dec 01 2023 at 19:14):

@Abhishek Anand according to the 2022 Coq survey (hundreds of responses), the community was then split fairly evenly between CoqIDE, Emacs+ProofGeneral, and VsCoq: https://coq.discourse.group/t/coq-community-survey-2022-results-part-ii/1746#coq-user-interfaces-7

view this post on Zulip Karl Palmskog (Dec 01 2023 at 19:16):

to my knowledge (I didn't try it), since CoqPilot is a general VS Code extension, it can be used together with both VsCoq2 and Coq-LSP, which I believe have the usual syntax highlighting and jump-to-definition stuff...

view this post on Zulip Karl Palmskog (Dec 01 2023 at 19:19):

the main difference from 2022 to today editor-wise is that it has been announced that CoqIDE is going to be removed from Coq's repository: https://github.com/coq/ceps/blob/master/text/068-coqide-split.md

view this post on Zulip Paolo Giarrusso (Dec 01 2023 at 21:53):

Wait, jump-to-definition in vscode for Coq? I didn't see it in either extension. Coq-lsp has an issue for it: https://github.com/ejgallego/coq-lsp/issues/317

view this post on Zulip Anton Podkopaev (Dec 04 2023 at 10:55):

@Abhishek Anand

Does someone know what features of emacs+proof-general+company-coq CoqPilot is missing?

Yes, @Karl Palmskog is right--CoqPilot is not trying to be a replacement for coq-lsp/VsCoq extensions and supposed to be used with them. Under the hood, we use coq-lsp for processing documents, but it doesn't prevent from using VsCoq as your main Coq-support plugin in VsCode.

view this post on Zulip Emilio Jesús Gallego Arias (Dec 04 2023 at 17:51):

Hi folks, great work! I'm very interesting in helping your extension to run smooth, I've created a coordination issue https://github.com/ejgallego/coq-lsp/issues/628 , but I'm also very happy to keep chatting in Zulip (maybe we could have a CoqPilot stream) or by Zoom cc: @Anton Podkopaev @Andrei Kozyrev

I have a few ideas already:

view this post on Zulip Emilio Jesús Gallego Arias (Dec 04 2023 at 17:53):

@Abhishek Anand coq-lsp does support jump to definition (on the same file for now only) and some forms of completion. More advanced completion is in a PR, as well as project-wide build and symbols lookup; there is still some work to do on both cases, but can be prioritized if needed.

view this post on Zulip Emilio Jesús Gallego Arias (Dec 04 2023 at 17:54):

Support for telemetry so far is only in the engine, but things are in place so it should be easy to log quite a few events.

view this post on Zulip Anton Podkopaev (Dec 05 2023 at 12:10):

Yes, thank you @Emilio Jesús Gallego Arias!

view this post on Zulip Emilio Jesús Gallego Arias (Dec 05 2023 at 18:24):

By the way @Anton Podkopaev @Andrei Kozyrev , you may be interested in the working group for Coq + UI that @Benjamin Pierce has been running lately (c.f. https://coq.zulipchat.com/#narrow/stream/237661-User-interfaces-devs-.26-users/topic/Topics.20for.20next.20Coq.20UI.20discussion.3F )

If you are interested, it seems to me that a presentation of CoqPilot could be interesting

view this post on Zulip Bas Spitters (Dec 07 2023 at 15:41):

I'm wondering whether techniques like deep learning with compiler feedback could also help us to have LLMs produce Coq code that is more likely to compile:
https://arxiv.org/abs/2305.18341
https://dl.acm.org/doi/pdf/10.1145/3576915.3623175

view this post on Zulip Bas Spitters (Dec 23 2023 at 16:23):

LeanPilot looks nice. How much of this functionality do we have available in CoqPilot already, perhaps by combining it with Tactician?
https://github.com/lean-dojo/LeanCopilot

view this post on Zulip Abhishek Anand (Dec 23 2023 at 16:31):

IIUC: CoqPilot does not do backtracking tree search and just gets LLMs to produce the whole proof in 1 shot. LeanPilot asks LLMs to generate next proof step and does backtracking based on the top few choices. I think CoqPilot should plug in COPRA (https://arxiv.org/pdf/2310.04353.pdf) , which does backtracking and also feeds error messages back to LLMs (ChatGPT).

view this post on Zulip Abhishek Anand (Dec 23 2023 at 16:32):

I think tweaking COPRA to do tree of thoughts should give another boost https://arxiv.org/abs/2305.10601

view this post on Zulip Jason Rute (Dec 23 2023 at 22:23):

Despite using language models, CoqPilot and LeanPilot use very different amounts of compute. LeanPilot runs on a CPU. But at the same time, it is not clear it is better than Coq tools like CoqHammer, Tactician or ProverBot9001 (when run on a CPU).

view this post on Zulip Jason Rute (Dec 23 2023 at 22:23):

Tactician (which is for Coq) and LeanPilot also have a very similar interfaces. Maybe people who are familiar with Coq and Lean should try them both out on similar problems and compare.

view this post on Zulip Karl Palmskog (Dec 23 2023 at 22:26):

probably good to keep in mind that the CoqHammer author deliberately did not support automatic proofs by induction (like HOL4's Metis and recent similar tools for Lean)

view this post on Zulip Karl Palmskog (Dec 23 2023 at 22:35):

(do machine learning evals generally take into account which logic fragment a goal requires?)

view this post on Zulip Jason Rute (Dec 23 2023 at 22:57):

No, evaluations don’t typically take this into account, but at the say time the big advantage of machine learning is that it isn’t constrained to a logic fragment. It is certainly true that CoqHammer works in problems the ML provers don’t, so they can be complementary. Some tools like Thor smartly combine ML and Hammer approaches. Others like MagnusHammer replace most of the ATP parts of the hammer with ML lemma selection.

view this post on Zulip Bas Spitters (Apr 04 2024 at 12:19):

I see CoqPilot has reached 2.0.0. Has anyone used it?
https://marketplace.visualstudio.com/items/JetBrains-Research.coqpilot/changelog

Aside, they mention limited context window of ChatGPT. Apparently X's Grok 1.5 provides a much larger window, but it needs a paid account. I'm not aware of any open models with a large context window.

view this post on Zulip Jason Rute (Apr 09 2024 at 10:48):

@Bas Spitters I agree with you that using tools like this is the best test. I think there are two ways to rank tools like this. (1) User experience and (2) ability to close the goals you want automatically closed.

My guess is that the current best tool for user experience is GitHub Copilot. I don't use it (because of work reasons), but the Lean people I talk to who do say it is better than the ITP-specific tools they have tried, and it does a lot more than close goals. Also, they like the user experience overall.

As for the ability to close goals, I find it very suspicious that CoqPilot hasn't benchmarked this. I think that means either it is really expensive to run, or it isn't very good at closing goals. From our Graph2Tac paper, Tactician's default k-NN seems to be the leader right now at closing goals, but it is even better if combined with Graph2Tac (and CoqHammer).

Of course, if some Coq users are adventurous enough to try all the tools (Github Copilot, CoqPilot, Tactictian, Graph2Tac, ProverBot9001 via Proofster, CoqHammer, and others) and rank them both in their user experience and ability to close goals, I think it would be a great community service.

view this post on Zulip Bas Spitters (Apr 09 2024 at 11:47):

@Anton Podkopaev has any such benchmarking been done for CoqPilot?

view this post on Zulip Bas Spitters (Apr 09 2024 at 11:56):

@Jason Rute Thanks!
I understand leanpilot depends on the OpenAI codex model.
I believe https://huggingface.co/blog/starcoder tries to be an open alternative.

What I understand from your Graph2Tac paper is that there are no real obstacles building a leanpilot for Coq.

view this post on Zulip Jason Rute (Apr 09 2024 at 12:39):

@Bas Spitters I’m not sure what you mean by “leanpilot”. I think you are conflating a number of projects. If you mean LeanCopilot, that uses a trained model based on byteT5 if I recall. It is a similar model as ReProver in Lean Dojo, but can run locally on your machine (no API needed). I’ve heard mixed reviews about if it is actually helpful. There are a bunch of older models you might also be thinking of. LeanChat was a chat interface for Lean which used Codex to formalize informal math to Lean 3. Codex no longer is available and I think ChatGPT replaces LeanChat fine. There was also Lean gptf which was based on the PACT paper. It used an API to OpenAI for tactic suggestions (but not completing proofs). That API is now turned off. There is a similar Facebook API model built into the Lean 3 infoview. It was based on the Hypertree Proof Search paper. It again was just tactic suggestions. I don’t know if it still works, but Lean 3 isn’t used anymore, and that team at Facebook (now Meta) is now mostly at MistralAI. As far as “leanpilot” in Coq, yes there are no barriers, and there are a number of interesting approaches one could take. I would say Tactician is already a “Coq copilot”. Of course, everything I’ve mentioned are research projects and don’t really have the perfect user experience (easy-to-install, runs in the background, auto-completion, etc.), nor has there been much effort to make sure users actually get value from these tools.

view this post on Zulip Anton Podkopaev (Apr 09 2024 at 12:41):

@Bas Spitters we are still working on benchmarking.

view this post on Zulip Jason Rute (Sep 11 2024 at 21:52):

I've started to play with CoqPilot. While it has many rough edges, it does seem to be a good step in the right direction! But I want to ask about the evaluations in the reports. @Anton Podkopaev Do you have any reason to think there isn't data leakage on the test theorems? Do you think GPT-4 could have been trained on the test theorem proofs from IMM? This could completely invalidate your results.

view this post on Zulip Jason Rute (Sep 11 2024 at 21:54):

(I'm referring to this report: https://aitp-conference.org/2024/abstract/AITP_2024_paper_15.pdf.)

view this post on Zulip Anton Podkopaev (Sep 12 2024 at 09:00):

Hi @Jason Rute !
We have a slightly bigger evaluation in our ASE paper. Yes, IMM might have been used for training used LLMs, that we don't know.
I'd not say that it completely invalidates our results, though, but, yes, it is a problem we plan to address in our next evaluations. Recently we developed a new benchmarking system which will make bigger evaluations easier in the framework.
Currently, we are considering using not the whole lemmas to test on, but their substatements, that should partially address the contamination issue.

view this post on Zulip Nikita Khramov (Sep 12 2024 at 11:47):

Hello @Jason Rute ! First of all, thank you for your feedback. We appreciate you taking the time to try out CoqPilot. Regarding the rough edges you mentioned, we’re eager to improve and refine the project. Could you please share more specifics?

Do you think fine-tuning the underlying models or incorporating techniques like STaR could help address the rough edges you encountered? Or would you recommend a different approach, such as focusing on better prompt engineering or enhancing the overall system architecture?

Any detailed feedback will help us prioritize improvements in the upcoming iterations.

view this post on Zulip Jason Rute (Sep 16 2024 at 11:27):

Sorry, I still have to respond to the questions above. But also, on the Lean Zulip, @Bas Spitters correctly pointed out it would be good to redo the benchmarks using the new o1 model (if you can afford to do so).


Last updated: Oct 13 2024 at 01:02 UTC