Just spotted this one. Has anyone tried it?
https://marketplace.visualstudio.com/items?itemName=JetBrains-Research.coqpilot
Unfortunately for my day work I cannot use such systems, since they would upload stuff to servers - which we don't want. Are there models one can run locally? How much GPU RAM do such models require? Testing it on toy stuff is likely not helpful to evaluate it.
if they only support openai models (which is claimed in the description), then no local models are currently supported. I guess we wait for someone to do the extension to a downloadable model.
for example, Facebook "open access" LLMs (LLaMa) have variants that fit in 8GB GPU RAM, but for reasonable performance, 16 GB RAM or even 24 GB RAM is probably good to be able to use LLaMa with more parameters.
24 GB wouldn't be a problem. I more feared that one needs a 8 high end GPU machine or the like.
I think the biggest released Facebook LLM can be run with 48 GB RAM. But as expected, it performs much worse than the models behind an API, at least on some tasks. Nevertheless, reasonable for use by paranoid power users, who might even do their own additional training on top of it.
What I don't quite get: these models generally understand Coq as is, or does one have to use a specially trained model?
Is it so that with a smallish model Coq is basically supported but unusable and with the big server models the Coq support is at leats not completely unreasonable?
I haven't done research on LLMs, but the general idea is that the model has a general knowledge of lots of languages which can be accessed through prompt engineering: "write a gcd function in Coq/SSReflect". Additional training could refine the responses to fit what Coq users expect, such as giving .v
code examples with comments
the main drawback of current non-tuned LLMs for Coq in my view is that they have never "seen" proof states, since this was not available in their general training data - they have only seen .v
files and other Vernacular fragments
we discovered using traditional neural learning techniques that even Coq code formatting by proof engineers typically takes proof state into account in a nontrivial way
bottom line, I think the smaller models (7B parameters, <= 8 GB RAM) are going to do Coq Gallina code suggestion quite well, and proofs quite badly. The situation with proofs might improve a bit if you move to 65B parameter models, and probably the Gallina writing might be a lot better, or at least have potential to be better with question-response tuning.
but people more familiar with LLM research might be able to contradict me...
@Michael Soegtrop Tactician runs locally (and adapts instantly to your projects.). Proverbot9001 and all the CoqGym family of models (ASTactic, TacTok, DIVA, Passport) also run locally, but they are research software and I don’t know how easy they are to use.
but I think the question here was specifically about models for applying CoqPilot? I wonder how many of ProverBot, etc., would support this workflow (coq-lsp is cutting edge Coq)
As for models which have seen proof states, the recent Llemma model has seen Lean and Isabelle proofs states (but not Coq). It isn’t really quite clear is GPT-4 has seen proof states. Daniel Selsam works at OpenAI and used to be a Lean developer. Also Open AI used to do theorem proving research on Lean. Also I have one GitHub project that has Lean proofstates in the hovertext of an html file. Also if they captured CoqGym (not sure how easy the dataset is to ingest, they might also get Coq proofstates. I guess the point is no one knows.
OK, let me rephrase defensively: the mistakes/inadequacies I observe in Coq code generated with current LLMs (Facebook LLaMA, ChatGPT, Copilot, etc.) are consistent with the models not being aware of Coq proof states, and also consistent with what we found lacking in custom neural models that were not aware of proof states (but only of Coq syntax, tokens, AST)
@Karl Palmskog @Jason Gross : thank you for the explanations and sharing your insights - very helpful!
As for this project, I wonder how it works. I also wonder if they will produce a paper or share the details otherwise. My guess is (like others have suggested) it doesn’t work so well, but that it is hard to tell since there aren’t good benchmarks (and it requires access to OpenAI’s models). By the way Copra is another paper using GPT-4 for proofs in Lean and Coq. I think the big thing this project provides is a nice user interface.
Copra: https://arxiv.org/abs/2310.04353
@Anton Podkopaev is one of the developers of Coqpilot. I hope he'll be able to answer the questions above.
My big question for @Anton Podkopaev is the following: Is it just more or less outputting a proof as a code completion task, or is it doing a proof search? Also, it is just proof completion right now, correct? There aren't other coding tasks yet, right?
it seems tempting to suggest that "proper" training for AI systems to do Coq proofs would involve low-level learning of primitives for manipulating the proof context, such as the nice set of primitives here.
At least I've always felt skeptical about the "Ltac completion" approach, not least because: that's not how real proof engineers do it.
Hi @Jason Rute , for now, CoqPilot does the following thing. Once a user invokes the corresponding action, CoqPilot takes proof states before each admit in the file and sends the request to ChatGPT to fill it. Then, ChatGPT sends us n potential solutions for each hole (30 is default one, but might be changed in settings), then we typecheck them and if a solution is accepted, we insert it instead of admit.
We don't try to do regular completion.
We are thinking on supporting local LLMs instead of ChatGPT remote calls.
@Anton Podkopaev are there examples of data sent to ChatGPT for some nontrivial Coq file somewhere? Would be interesting to feed one manually to different LLMs just to get a general idea of how they do
@Anton Podkopaev : I wonder if you tell ChatGPT in some way which of the 30 solutions was successful. If not, how does it improve on this?
@Anton Podkopaev The one-shot experiments in the Copra paper, as well as the personal experience of anyone who has tried to use ChatGPT to fill in a proof in Coq (or Lean) suggests this approach isn't that strong (especially compared to all the other work in this field). Is there a reason you think this is a value add to users? Is the user interface the real selling feature here? Do you find it is actually very useful in practice? Or is this just a stepping stone to better approaches?
Although, as I've often said the true test is the end users, so if people find value, that is good enough, and it is quite commendable you have an easy-to-use tool.
@Jason Rute, for now, CoqPilot is an MVP. We plan to implement and compare other strategies, but I believe that this is a nice interface.
Anton Podkopaev said:
Hi Jason Rute , for now, CoqPilot does the following thing. Once a user invokes the corresponding action, CoqPilot takes proof states before each admit in the file and sends the request to ChatGPT to fill it. Then, ChatGPT sends us n potential solutions for each hole (30 is default one, but might be changed in settings), then we typecheck them and if a solution is accepted, we insert it instead of admit.
you may get better results by sending the error messages back to chatgpt asking for a fix, and also encouraging it to ask queries instead of hallucinating lemma names:
https://chat.openai.com/share/63cb977c-6022-4ee0-8618-8c6d04cb56b4
@Michael Soegtrop currently, we are not sending any feedback to ChatGPT, but we’ll definitely experiment on these stategies.
I would love it if work like this also was able to publish performance on benchmarks so that it could be compared to other work. I understand comparing automated theorem provers for ITPs (especially Coq) is a huge pain, but it would help a lot.
Yes, I agree, we definitely plan to have benchmarks eventually :)
BTW, we are happy to collaborate on the tool, if anyone is interested )
Have you considered autoformalization?
Would custom GPTs be helpful here? I haven't looked at how hard they are to build?
https://openai.com/blog/introducing-gpts
Karl Palmskog писал/а:
Anton Podkopaev are there examples of data sent to ChatGPT for some nontrivial Coq file somewhere? Would be interesting to feed one manually to different LLMs just to get a general idea of how they do
Hi @Karl Palmskog! For now the main purpose of sending data except for the theorem we are proving is to fix the format of the output for the LLM, i.e. GPT. The problem is that we want GPT response to only contain a specific format coq code, e.g Proof. ${proof} Qed. To achieve this, we send some simple system propmt and then a history of messages in the following way:
user: Theorem test : P.
assistant: Proof. ... Qed.
...
This history is being taken from the file and contains only the theorems and only those that have valid proofs.
This also gives GPT the idea about theorems we have in context and proofs of potentially similar goals.
All in all, we wanted to create a framework as modular around the LLM as possible and would love to experiment on other models than GPT.
@Bas Spitters, yes, we have looked into it. Might be helpfull. There are many ideas and ways of improvement. Right now we have a plain baseline. Thank you for the ideas!
thanks, Andrei. Then I guess the most challenging limitations in open access LLM models will be the size of the context, which will be dependent on the size of the Coq file. Without fine-tuning, I think some models max out at a couple of thousand tokens of context...
Yes, you are right, unfortunatelly it is especially the case, for example, for gpt3.5, and only around ~10-12 small-average size theorems fit into the limitation. So if we exceed it, we randomly choose the theorems until the limit is reached or choose those which are around the theorem to be solved.
For gpt4 however the limit is times bigger, and as we have yet observed, straight out of the box results are better, but the model is times more expensive.
hmm, instead of choosing randomly, you might want to run premise selection as in CoqHammer? https://github.com/lukaszcz/coqhammer/tree/coq8.18/src/predict (this predict
executable will be available to anyone that installs the Coq Platform)
Abhishek Anand писал/а:
Anton Podkopaev said:
Hi Jason Rute , for now, CoqPilot does the following thing. Once a user invokes the corresponding action, CoqPilot takes proof states before each admit in the file and sends the request to ChatGPT to fill it. Then, ChatGPT sends us n potential solutions for each hole (30 is default one, but might be changed in settings), then we typecheck them and if a solution is accepted, we insert it instead of admit.
you may get better results by sending the error messages back to chatgpt asking for a fix, and also encouraging it to ask queries (e.g.
Search
) instead of hallucinating lemma names:
https://chat.openai.com/share/63cb977c-6022-4ee0-8618-8c6d04cb56b4
Hi @Abhishek Anand! Yes, we also feel it can improve performance. Hopefully this could be added in milestone 2.0.0.
Karl Palmskog писал/а:
hmm, instead of choosing randomly, you might want to run premise selection as in CoqHammer? https://github.com/lukaszcz/coqhammer/tree/coq8.18/src/predict (this
predict
executable will be available to anyone that installs the Coq Platform)
Thanks, Karl! I am not well familiar with how CoqHammer works, but will dive into this!
Does someone know what features of emacs+proof-general+company-coq CoqPilot is missing? Does CoqPilot have jump to definition, basic syntax highlighting, some autocomplete smarter than just suggesting words in the current document (e.g. company coq queries Coq for all constants visible at the current position, along with their minimally+sufficiently qualified names)? I am thinking about instrumenting some currently popular open-source Coq IDE with features to record how humans do proof, like the REPLICA work. I currently use company-coq but it seems most Coq programmers are switching or have already switched to vscoq.
@Abhishek Anand according to the 2022 Coq survey (hundreds of responses), the community was then split fairly evenly between CoqIDE, Emacs+ProofGeneral, and VsCoq: https://coq.discourse.group/t/coq-community-survey-2022-results-part-ii/1746#coq-user-interfaces-7
to my knowledge (I didn't try it), since CoqPilot is a general VS Code extension, it can be used together with both VsCoq2 and Coq-LSP, which I believe have the usual syntax highlighting and jump-to-definition stuff...
the main difference from 2022 to today editor-wise is that it has been announced that CoqIDE is going to be removed from Coq's repository: https://github.com/coq/ceps/blob/master/text/068-coqide-split.md
Wait, jump-to-definition in vscode for Coq? I didn't see it in either extension. Coq-lsp has an issue for it: https://github.com/ejgallego/coq-lsp/issues/317
@Abhishek Anand
Does someone know what features of emacs+proof-general+company-coq CoqPilot is missing?
Yes, @Karl Palmskog is right--CoqPilot is not trying to be a replacement for coq-lsp/VsCoq extensions and supposed to be used with them. Under the hood, we use coq-lsp for processing documents, but it doesn't prevent from using VsCoq as your main Coq-support plugin in VsCode.
Hi folks, great work! I'm very interesting in helping your extension to run smooth, I've created a coordination issue https://github.com/ejgallego/coq-lsp/issues/628 , but I'm also very happy to keep chatting in Zulip (maybe we could have a CoqPilot stream) or by Zoom cc: @Anton Podkopaev @Andrei Kozyrev
I have a few ideas already:
coq-lsp/VSCode
extension; ViZX and graph-editor already do that, and it works for themcoq-lsp
API and protocol quite a bit so you can directly get the info you are requesting in FlecheDocUtils
and provide to you a stable interface (the current approach is too low level and thus sensitive to Coq versions); moreover Fleche internally has all the info you are reconstructing.@Abhishek Anand coq-lsp does support jump to definition (on the same file for now only) and some forms of completion. More advanced completion is in a PR, as well as project-wide build and symbols lookup; there is still some work to do on both cases, but can be prioritized if needed.
Support for telemetry so far is only in the engine, but things are in place so it should be easy to log quite a few events.
Yes, thank you @Emilio Jesús Gallego Arias!
By the way @Anton Podkopaev @Andrei Kozyrev , you may be interested in the working group for Coq + UI that @Benjamin Pierce has been running lately (c.f. https://coq.zulipchat.com/#narrow/stream/237661-User-interfaces-devs-.26-users/topic/Topics.20for.20next.20Coq.20UI.20discussion.3F )
If you are interested, it seems to me that a presentation of CoqPilot could be interesting
I'm wondering whether techniques like deep learning with compiler feedback could also help us to have LLMs produce Coq code that is more likely to compile:
https://arxiv.org/abs/2305.18341
https://dl.acm.org/doi/pdf/10.1145/3576915.3623175
LeanPilot looks nice. How much of this functionality do we have available in CoqPilot already, perhaps by combining it with Tactician?
https://github.com/lean-dojo/LeanCopilot
IIUC: CoqPilot does not do backtracking tree search and just gets LLMs to produce the whole proof in 1 shot. LeanPilot asks LLMs to generate next proof step and does backtracking based on the top few choices. I think CoqPilot should plug in COPRA (https://arxiv.org/pdf/2310.04353.pdf) , which does backtracking and also feeds error messages back to LLMs (ChatGPT).
I think tweaking COPRA to do tree of thoughts should give another boost https://arxiv.org/abs/2305.10601
Despite using language models, CoqPilot and LeanPilot use very different amounts of compute. LeanPilot runs on a CPU. But at the same time, it is not clear it is better than Coq tools like CoqHammer, Tactician or ProverBot9001 (when run on a CPU).
Tactician (which is for Coq) and LeanPilot also have a very similar interfaces. Maybe people who are familiar with Coq and Lean should try them both out on similar problems and compare.
probably good to keep in mind that the CoqHammer author deliberately did not support automatic proofs by induction (like HOL4's Metis and recent similar tools for Lean)
(do machine learning evals generally take into account which logic fragment a goal requires?)
No, evaluations don’t typically take this into account, but at the say time the big advantage of machine learning is that it isn’t constrained to a logic fragment. It is certainly true that CoqHammer works in problems the ML provers don’t, so they can be complementary. Some tools like Thor smartly combine ML and Hammer approaches. Others like MagnusHammer replace most of the ATP parts of the hammer with ML lemma selection.
I see CoqPilot has reached 2.0.0. Has anyone used it?
https://marketplace.visualstudio.com/items/JetBrains-Research.coqpilot/changelog
Aside, they mention limited context window of ChatGPT. Apparently X's Grok 1.5 provides a much larger window, but it needs a paid account. I'm not aware of any open models with a large context window.
@Bas Spitters I agree with you that using tools like this is the best test. I think there are two ways to rank tools like this. (1) User experience and (2) ability to close the goals you want automatically closed.
My guess is that the current best tool for user experience is GitHub Copilot. I don't use it (because of work reasons), but the Lean people I talk to who do say it is better than the ITP-specific tools they have tried, and it does a lot more than close goals. Also, they like the user experience overall.
As for the ability to close goals, I find it very suspicious that CoqPilot hasn't benchmarked this. I think that means either it is really expensive to run, or it isn't very good at closing goals. From our Graph2Tac paper, Tactician's default k-NN seems to be the leader right now at closing goals, but it is even better if combined with Graph2Tac (and CoqHammer).
Of course, if some Coq users are adventurous enough to try all the tools (Github Copilot, CoqPilot, Tactictian, Graph2Tac, ProverBot9001 via Proofster, CoqHammer, and others) and rank them both in their user experience and ability to close goals, I think it would be a great community service.
@Anton Podkopaev has any such benchmarking been done for CoqPilot?
@Jason Rute Thanks!
I understand leanpilot depends on the OpenAI codex model.
I believe https://huggingface.co/blog/starcoder tries to be an open alternative.
What I understand from your Graph2Tac paper is that there are no real obstacles building a leanpilot for Coq.
@Bas Spitters I’m not sure what you mean by “leanpilot”. I think you are conflating a number of projects. If you mean LeanCopilot, that uses a trained model based on byteT5 if I recall. It is a similar model as ReProver in Lean Dojo, but can run locally on your machine (no API needed). I’ve heard mixed reviews about if it is actually helpful. There are a bunch of older models you might also be thinking of. LeanChat was a chat interface for Lean which used Codex to formalize informal math to Lean 3. Codex no longer is available and I think ChatGPT replaces LeanChat fine. There was also Lean gptf which was based on the PACT paper. It used an API to OpenAI for tactic suggestions (but not completing proofs). That API is now turned off. There is a similar Facebook API model built into the Lean 3 infoview. It was based on the Hypertree Proof Search paper. It again was just tactic suggestions. I don’t know if it still works, but Lean 3 isn’t used anymore, and that team at Facebook (now Meta) is now mostly at MistralAI. As far as “leanpilot” in Coq, yes there are no barriers, and there are a number of interesting approaches one could take. I would say Tactician is already a “Coq copilot”. Of course, everything I’ve mentioned are research projects and don’t really have the perfect user experience (easy-to-install, runs in the background, auto-completion, etc.), nor has there been much effort to make sure users actually get value from these tools.
@Bas Spitters we are still working on benchmarking.
I've started to play with CoqPilot. While it has many rough edges, it does seem to be a good step in the right direction! But I want to ask about the evaluations in the reports. @Anton Podkopaev Do you have any reason to think there isn't data leakage on the test theorems? Do you think GPT-4 could have been trained on the test theorem proofs from IMM? This could completely invalidate your results.
(I'm referring to this report: https://aitp-conference.org/2024/abstract/AITP_2024_paper_15.pdf.)
Hi @Jason Rute !
We have a slightly bigger evaluation in our ASE paper. Yes, IMM might have been used for training used LLMs, that we don't know.
I'd not say that it completely invalidates our results, though, but, yes, it is a problem we plan to address in our next evaluations. Recently we developed a new benchmarking system which will make bigger evaluations easier in the framework.
Currently, we are considering using not the whole lemmas to test on, but their substatements, that should partially address the contamination issue.
Hello @Jason Rute ! First of all, thank you for your feedback. We appreciate you taking the time to try out CoqPilot. Regarding the rough edges you mentioned, we’re eager to improve and refine the project. Could you please share more specifics?
Do you think fine-tuning the underlying models or incorporating techniques like STaR could help address the rough edges you encountered? Or would you recommend a different approach, such as focusing on better prompt engineering or enhancing the overall system architecture?
Any detailed feedback will help us prioritize improvements in the upcoming iterations.
Sorry, I still have to respond to the questions above. But also, on the Lean Zulip, @Bas Spitters correctly pointed out it would be good to redo the benchmarks using the new o1 model (if you can afford to do so).
Last updated: Oct 13 2024 at 01:02 UTC