Are there any benchmarks or evaluations for automating program verification?
https://arxiv.org/abs/2310.04870
I haven’t read it yet
If you do, I’d love a summary.
Also, what in particular are you looking for? Program generation + verification? Because there are lots of benchmarks on program generation. And lots of benchmarks on theorem proving. But few if any which combine them.
I was wondering just about the program verification part. I've seen a bunch of benchmarks for theorem proving, but they seem to be closer to math autoformalization and further from program verification.
This was inspired by seeing a presentation on https://www.openphilanthropy.org/rfp-llm-benchmarks/ and hearing that they'd be open to / interested in proposals for benchmarking LLMs on program verification.
But what kind of program verification? Do you have a program and a spec? Do you just have a program and you need to generate the spec? Or is it just theorem proving where you are given a theorem in say Coq (maybe a compcert theorem) and they have to complete the proof. There are tons of works doing the last thing.
There is a benchmark for ITP called MiniF2F. It is unfortunately not in Coq, just Metamath, Isabelle, and Lean. It contains formal statements of high school math competition theorems to prove. I could see value in creating a PV version of that where the theorems are more similar to those you find in program verification. I have also thought about building a dataset for Coq/Lean/Agda where you must both synthesis a function and prove stuff about it. Maybe this would be more practical in C++ or Java but I don’t know how proving stuff about C++ or Java programs work.
I also thought someone told me about a competition where one must generate programs according to a specification and those are automatically checked with an ATP. I could be misunderstanding. I can’t find it by Googling.
(But this is the other direction from what you are proposing.)
And back to MiniF2F, there are tons of benchmarks where the goal is to reprove theorems in Coq, Isabelle, Lean, etc. just from the theorem statement. One in Coq is the CoqGym benchmark. (CoqHammer, Tactician, and ProverBot also have their own similar benchmarks as well.)
@Jason Gross I reread your response about math autoformalization. I’m a bit confused. Is the emphasis on math or autoformalization? If the latter, I think I make the point above that a number of benchmarks, like miniF2F, aren’t really about autoformalization (converting natural language into formal language), but instead automated theorem proving in some ITP language. If the emphasis is on math, I’m not really convinced that the techniques needed to solve a theorem in compcert are that different from a theorem in mathcomp. But the Coq theorem provers have some program verification stuff in their test suite. CoqGym is a mix of packages. The original ProverBot9001 paper was trained and tested on CompCert, and the recent Corpa paper using GPT-4 tested on the same CompCert theorems to compare to ProverBot9001.
Also besides automatic theorem proving in Coq, there is also work on AI for proof repair in Coq and work on AI for lemma naming in Coq.
But overall I’d be curious what you would like a program verification benchmark to look like.
(Sorry, I’m talking too much, but) the Lean folks often complain that the currrent benchmarks and projects don’t accurately represent their work. Doing well on MiniF2F doesn’t mean an ATP will be useful for formalizing math in Lean. I’ve long told them that they should build their own benchmarks which better capture their goals and desires in an AI assistant. I think the same could go for program verification (which I know less about).
one low-hanging fruit for program verification benchmark is probably the POPLMark challenge. There are tons of different solutions floating around for each problem.
for Coq/Agda/Lean, it would probably make sense to have a "sigma type" benchmark, where both a program and its correctness proof are supposed to be produced at the same time (as an inhabitant of a given sigma type)
@Karl Palmskog can you say more what this POPLMark challenge is? I tried to read a bit on that site and it was all very vague sounding.
POPLMark was about formalizing programming language metatheory, that is, essentially proving properties about all programs in a language. In a typed language, two classical properties are:
When the language uses complex binding of variables, these properties can be tricky to prove, and POPLMark specifically wanted people to prove properties about languages with binders.
here is one solution to a subset of POPLMark problems in MathComp, maybe that can make it more concrete: https://github.com/math-comp/POPLmark
I believe each POPLMark problem can at a high level be divided into the following tasks:
So if turned into an AI benchmark, one might want to consider those separately.
That sounds like a good, although ambitious, set of benchmarks.
for training/evaluation, I think a lot of the manually-written POPLMark solutions could be split up according to the three tasks. But the interesting aspect is that the formulation of both the syntax and the inductive relations are critical for getting the proofs through Coq/Lean/etc. Even small differences in binder representation can make proofs much easier or much harder.
I'd like to see a benchmark containing pairs of c/c++/rust functions and their specs, e.g. in separation logic (e.g. iris). Andrew Appel's VST work can be a good source as most of it is already open source. At Bedrock Systems, we have a large set of functions with specs and proofs, and AFAIK we have plans to open source them eventually.....
(Along with benchmarks, it would be great if people proving such specs could use some (to-be-developed) IDE plugin to record more info for training, e.g. how users edit specs while developing proofs (e.g. when users realize the spec is unprovable), how users edit specs/definitions on errors, how users query Coq while writing specs/proofs.)
I think the IDE you want is already developed, that's where the analytics data repo came from. Someone would have to take over collecting the data now that that study has ended, but the collection code is all there. https://github.com/uwplse/coq-change-analytics https://www.alexsanchezstern.com/papers/replica.pdf
I have seen that work and it would be great to build upon: I would even add the option of recording word-level edits that happen in between calls to coqtop, but maybe it wont provide much additional information.
I'm thinking of something like what you're describing @Abhishek Anand . I think the most ambitious form of this benchmark would be something that scales from the relatively straightforward (a large suite of C/C++/Rust functions+specs, the goal being to synthesize the spec from the function source code + documentation, and to synthesize a proof from the spec + function source code + documentation) up to super-ambitious (given scaffolding to explore, compile, and run a large but in-theory well-delimited project, e.g., the source of sha512sum, gcc, clang, coqchk; and access to the proof assistant, be able to make significant progress towards formally verifying the code base, either autonomously or in assisting a human)
I'm interested in benchmarks especially that are geared (at the medium-to-difficult end, at least) at realistic tasks that could actually speed up the efforts of formal verification engineers.
I think things like POPLMark are interesting, but feel much more like research than engineering, and I'm more interested in benchmarks that test the ability to deploy, scale, and automate existing accepted solutions in new-but-not-fundamentally-novel domains. e.g., I'm less interested in benchmarks of the ability to automatically come up with especially tricky loop invariants than in the ability to automate all the work around the loop invariant, so that, e.g., a human could be shown just the functions or loops where automatic spec/invariant discovery failed, could come up with the insight or understanding, and then automate the process of testing out that loop invariant or spec or w/e. (This seems compatible with the Lemur paper you linked @Jason Rute , btw)
Maybe this is too ambitious though?
Jason Rute said:
If you do, I’d love a summary.
Aren't summaries what LLMs are supposed to be good at :-)
I think ChatGPT could probably summarize a paper as least as well as, say, someone who isn't really reading the paper carefully and has strong preconceived ideas about the paper actually says.
Feel free to give it a try and post it here.
Ah, doing stroke-level recording would require much deeper instrumentation, and would likely have a harder time finding users due to privacy concerns. To me it seems like you wouldn't need it, because between calls to coqtop the person writing the code has no extra feedback or implicit state, making it much like the text generation tasks that LLM's already work really well at. But I could see wanting to explore that if you're willing to put in the engineering time.
Alex Sanchez-Stern said:
To me it seems like you wouldn't need it, because between calls to coqtop the person writing the code has no extra feedback or implicit state, making it much like the text generation tasks that LLM's already work really well at.
I think there is some useful state in what happens even between communications with coqtop. For example, IMO the biggest reason ChatGPT is so bad at dependently typed programming is that it keeps hallucinating library items instead of doing queries (e.g. Search) like an actual human. When I write a nontrivial Gallina function, I often have to do many queries even before I try to coqtop-check the function I am writing. Typically, those queries go just above the function I am writing. I think it is important to teach LLMs, e.g. via fine-tuning, when to ask what queries while writing Gallina functions and valuable information would be lost if we just look at the coqtop communication.
That said, I dont see much value in going all the way to keystroke-level recording. Because LLMs produce words at a time, I dont think keystroke-level tracking would provide more useful info for training than tracking user actions at word level.
Oh, hmm I think we might have a difference in understanding of how coqtop works. In my understanding, queries like Search, Locate, Check, etc that the user might write to get more information before writing a function are also sent through CoqTop. For instance, if a REPLICA user started writing a function, then ran some search commands before it, cancelled them, and used that information to finish writing the function, REPLICA would see them run each query command before sending the function, and record exactly which queries they made.
yes, those queries go through coqtop in Proof General, since there's nothing else to do. Fancier IDE protocols might add side channels
(fancier, as the three protocols used respectively by coqide, vscoq 2, and coq-lsp)
Right, but in those cases you would just instrument the whole protocol handler that would include those side channels, similar to how REPLICA instruments the whole protocol handler of coqtop.
Alex Sanchez-Stern said:
Oh, hmm I think we might have a difference in understanding of how coqtop works. In my understanding, queries like Search, Locate, Check, etc that the user might write to get more information before writing a function are also sent through CoqTop. For instance, if a REPLICA user started writing a function, then ran some search commands before it, cancelled them, and used that information to finish writing the function, REPLICA would see them run each query command before sending the function, and record exactly which queries they made.
no, I think we are on the same page on how coqtop works. I should have explained it better. yes, REPLICA would see the queries but not see the context (partially written function) in which those queries were made. REPLICA would record a bunch of queries and then a full function definition. For small functions, this may suffice. For larger functions, I think the context of the partially written function would be useful. I want to train LLMs to ask for queries while writing parts of functions, just like humans do.
Indeed, REPLICA used an STM hook that has quite a few limitations. I took this into consideration when designing coq-lsp/Flèche so you can instrument at a much lower level.
Note that a quite useful thing w.r.t. instrumenting coq-lsp
is that it can deployed in web version quite easily, this is very convenient for running 0-install user experiments.
Last updated: Oct 13 2024 at 01:02 UTC