Stream: Machine learning and automation

Topic: benchmarks for program verification?


view this post on Zulip Jason Gross (Nov 17 2023 at 01:59):

Are there any benchmarks or evaluations for automating program verification?

view this post on Zulip Jason Rute (Nov 17 2023 at 02:18):

https://arxiv.org/abs/2310.04870

view this post on Zulip Jason Rute (Nov 17 2023 at 02:20):

I haven’t read it yet

view this post on Zulip Jason Rute (Nov 17 2023 at 02:21):

If you do, I’d love a summary.

view this post on Zulip Jason Rute (Nov 17 2023 at 02:43):

Also, what in particular are you looking for? Program generation + verification? Because there are lots of benchmarks on program generation. And lots of benchmarks on theorem proving. But few if any which combine them.

view this post on Zulip Jason Gross (Nov 17 2023 at 04:41):

I was wondering just about the program verification part. I've seen a bunch of benchmarks for theorem proving, but they seem to be closer to math autoformalization and further from program verification.

view this post on Zulip Jason Gross (Nov 17 2023 at 04:43):

This was inspired by seeing a presentation on https://www.openphilanthropy.org/rfp-llm-benchmarks/ and hearing that they'd be open to / interested in proposals for benchmarking LLMs on program verification.

view this post on Zulip Jason Rute (Nov 17 2023 at 04:48):

But what kind of program verification? Do you have a program and a spec? Do you just have a program and you need to generate the spec? Or is it just theorem proving where you are given a theorem in say Coq (maybe a compcert theorem) and they have to complete the proof. There are tons of works doing the last thing.

view this post on Zulip Jason Rute (Nov 17 2023 at 04:56):

There is a benchmark for ITP called MiniF2F. It is unfortunately not in Coq, just Metamath, Isabelle, and Lean. It contains formal statements of high school math competition theorems to prove. I could see value in creating a PV version of that where the theorems are more similar to those you find in program verification. I have also thought about building a dataset for Coq/Lean/Agda where you must both synthesis a function and prove stuff about it. Maybe this would be more practical in C++ or Java but I don’t know how proving stuff about C++ or Java programs work.

view this post on Zulip Jason Rute (Nov 17 2023 at 04:58):

I also thought someone told me about a competition where one must generate programs according to a specification and those are automatically checked with an ATP. I could be misunderstanding. I can’t find it by Googling.

view this post on Zulip Jason Rute (Nov 17 2023 at 04:58):

(But this is the other direction from what you are proposing.)

view this post on Zulip Jason Rute (Nov 17 2023 at 05:05):

And back to MiniF2F, there are tons of benchmarks where the goal is to reprove theorems in Coq, Isabelle, Lean, etc. just from the theorem statement. One in Coq is the CoqGym benchmark. (CoqHammer, Tactician, and ProverBot also have their own similar benchmarks as well.)

view this post on Zulip Jason Rute (Nov 17 2023 at 14:59):

@Jason Gross I reread your response about math autoformalization. I’m a bit confused. Is the emphasis on math or autoformalization? If the latter, I think I make the point above that a number of benchmarks, like miniF2F, aren’t really about autoformalization (converting natural language into formal language), but instead automated theorem proving in some ITP language. If the emphasis is on math, I’m not really convinced that the techniques needed to solve a theorem in compcert are that different from a theorem in mathcomp. But the Coq theorem provers have some program verification stuff in their test suite. CoqGym is a mix of packages. The original ProverBot9001 paper was trained and tested on CompCert, and the recent Corpa paper using GPT-4 tested on the same CompCert theorems to compare to ProverBot9001.

view this post on Zulip Jason Rute (Nov 17 2023 at 15:09):

Also besides automatic theorem proving in Coq, there is also work on AI for proof repair in Coq and work on AI for lemma naming in Coq.

view this post on Zulip Jason Rute (Nov 17 2023 at 15:09):

But overall I’d be curious what you would like a program verification benchmark to look like.

view this post on Zulip Jason Rute (Nov 17 2023 at 15:19):

(Sorry, I’m talking too much, but) the Lean folks often complain that the currrent benchmarks and projects don’t accurately represent their work. Doing well on MiniF2F doesn’t mean an ATP will be useful for formalizing math in Lean. I’ve long told them that they should build their own benchmarks which better capture their goals and desires in an AI assistant. I think the same could go for program verification (which I know less about).

view this post on Zulip Karl Palmskog (Nov 17 2023 at 15:32):

one low-hanging fruit for program verification benchmark is probably the POPLMark challenge. There are tons of different solutions floating around for each problem.

view this post on Zulip Karl Palmskog (Nov 17 2023 at 15:34):

for Coq/Agda/Lean, it would probably make sense to have a "sigma type" benchmark, where both a program and its correctness proof are supposed to be produced at the same time (as an inhabitant of a given sigma type)

view this post on Zulip Jason Rute (Nov 17 2023 at 18:34):

@Karl Palmskog can you say more what this POPLMark challenge is? I tried to read a bit on that site and it was all very vague sounding.

view this post on Zulip Karl Palmskog (Nov 17 2023 at 18:43):

POPLMark was about formalizing programming language metatheory, that is, essentially proving properties about all programs in a language. In a typed language, two classical properties are:

When the language uses complex binding of variables, these properties can be tricky to prove, and POPLMark specifically wanted people to prove properties about languages with binders.

view this post on Zulip Karl Palmskog (Nov 17 2023 at 18:46):

here is one solution to a subset of POPLMark problems in MathComp, maybe that can make it more concrete: https://github.com/math-comp/POPLmark

view this post on Zulip Karl Palmskog (Nov 17 2023 at 18:52):

I believe each POPLMark problem can at a high level be divided into the following tasks:

So if turned into an AI benchmark, one might want to consider those separately.

view this post on Zulip Jason Rute (Nov 17 2023 at 18:57):

That sounds like a good, although ambitious, set of benchmarks.

view this post on Zulip Karl Palmskog (Nov 17 2023 at 19:02):

for training/evaluation, I think a lot of the manually-written POPLMark solutions could be split up according to the three tasks. But the interesting aspect is that the formulation of both the syntax and the inductive relations are critical for getting the proofs through Coq/Lean/etc. Even small differences in binder representation can make proofs much easier or much harder.

view this post on Zulip Abhishek Anand (Nov 17 2023 at 19:44):

I'd like to see a benchmark containing pairs of c/c++/rust functions and their specs, e.g. in separation logic (e.g. iris). Andrew Appel's VST work can be a good source as most of it is already open source. At Bedrock Systems, we have a large set of functions with specs and proofs, and AFAIK we have plans to open source them eventually.....
(Along with benchmarks, it would be great if people proving such specs could use some (to-be-developed) IDE plugin to record more info for training, e.g. how users edit specs while developing proofs (e.g. when users realize the spec is unprovable), how users edit specs/definitions on errors, how users query Coq while writing specs/proofs.)

view this post on Zulip Alex Sanchez-Stern (Nov 17 2023 at 20:02):

I think the IDE you want is already developed, that's where the analytics data repo came from. Someone would have to take over collecting the data now that that study has ended, but the collection code is all there. https://github.com/uwplse/coq-change-analytics https://www.alexsanchezstern.com/papers/replica.pdf

view this post on Zulip Abhishek Anand (Nov 17 2023 at 23:11):

I have seen that work and it would be great to build upon: I would even add the option of recording word-level edits that happen in between calls to coqtop, but maybe it wont provide much additional information.

view this post on Zulip Jason Gross (Nov 18 2023 at 22:40):

I'm thinking of something like what you're describing @Abhishek Anand . I think the most ambitious form of this benchmark would be something that scales from the relatively straightforward (a large suite of C/C++/Rust functions+specs, the goal being to synthesize the spec from the function source code + documentation, and to synthesize a proof from the spec + function source code + documentation) up to super-ambitious (given scaffolding to explore, compile, and run a large but in-theory well-delimited project, e.g., the source of sha512sum, gcc, clang, coqchk; and access to the proof assistant, be able to make significant progress towards formally verifying the code base, either autonomously or in assisting a human)

I'm interested in benchmarks especially that are geared (at the medium-to-difficult end, at least) at realistic tasks that could actually speed up the efforts of formal verification engineers.

I think things like POPLMark are interesting, but feel much more like research than engineering, and I'm more interested in benchmarks that test the ability to deploy, scale, and automate existing accepted solutions in new-but-not-fundamentally-novel domains. e.g., I'm less interested in benchmarks of the ability to automatically come up with especially tricky loop invariants than in the ability to automate all the work around the loop invariant, so that, e.g., a human could be shown just the functions or loops where automatic spec/invariant discovery failed, could come up with the insight or understanding, and then automate the process of testing out that loop invariant or spec or w/e. (This seems compatible with the Lemur paper you linked @Jason Rute , btw)

Maybe this is too ambitious though?

view this post on Zulip Bas Spitters (Nov 19 2023 at 20:04):

Jason Rute said:

If you do, I’d love a summary.

Aren't summaries what LLMs are supposed to be good at :-)

view this post on Zulip Patrick Nicodemus (Nov 19 2023 at 21:32):

I think ChatGPT could probably summarize a paper as least as well as, say, someone who isn't really reading the paper carefully and has strong preconceived ideas about the paper actually says.

view this post on Zulip Jason Rute (Nov 19 2023 at 22:14):

Feel free to give it a try and post it here.

view this post on Zulip Alex Sanchez-Stern (Nov 20 2023 at 19:16):

Ah, doing stroke-level recording would require much deeper instrumentation, and would likely have a harder time finding users due to privacy concerns. To me it seems like you wouldn't need it, because between calls to coqtop the person writing the code has no extra feedback or implicit state, making it much like the text generation tasks that LLM's already work really well at. But I could see wanting to explore that if you're willing to put in the engineering time.

view this post on Zulip Abhishek Anand (Dec 01 2023 at 14:10):

Alex Sanchez-Stern said:

To me it seems like you wouldn't need it, because between calls to coqtop the person writing the code has no extra feedback or implicit state, making it much like the text generation tasks that LLM's already work really well at.

I think there is some useful state in what happens even between communications with coqtop. For example, IMO the biggest reason ChatGPT is so bad at dependently typed programming is that it keeps hallucinating library items instead of doing queries (e.g. Search) like an actual human. When I write a nontrivial Gallina function, I often have to do many queries even before I try to coqtop-check the function I am writing. Typically, those queries go just above the function I am writing. I think it is important to teach LLMs, e.g. via fine-tuning, when to ask what queries while writing Gallina functions and valuable information would be lost if we just look at the coqtop communication.
That said, I dont see much value in going all the way to keystroke-level recording. Because LLMs produce words at a time, I dont think keystroke-level tracking would provide more useful info for training than tracking user actions at word level.

view this post on Zulip Alex Sanchez-Stern (Dec 01 2023 at 17:32):

Oh, hmm I think we might have a difference in understanding of how coqtop works. In my understanding, queries like Search, Locate, Check, etc that the user might write to get more information before writing a function are also sent through CoqTop. For instance, if a REPLICA user started writing a function, then ran some search commands before it, cancelled them, and used that information to finish writing the function, REPLICA would see them run each query command before sending the function, and record exactly which queries they made.

view this post on Zulip Paolo Giarrusso (Dec 01 2023 at 17:53):

yes, those queries go through coqtop in Proof General, since there's nothing else to do. Fancier IDE protocols might add side channels

view this post on Zulip Paolo Giarrusso (Dec 01 2023 at 17:54):

(fancier, as the three protocols used respectively by coqide, vscoq 2, and coq-lsp)

view this post on Zulip Alex Sanchez-Stern (Dec 01 2023 at 17:55):

Right, but in those cases you would just instrument the whole protocol handler that would include those side channels, similar to how REPLICA instruments the whole protocol handler of coqtop.

view this post on Zulip Abhishek Anand (Dec 01 2023 at 18:02):

Alex Sanchez-Stern said:

Oh, hmm I think we might have a difference in understanding of how coqtop works. In my understanding, queries like Search, Locate, Check, etc that the user might write to get more information before writing a function are also sent through CoqTop. For instance, if a REPLICA user started writing a function, then ran some search commands before it, cancelled them, and used that information to finish writing the function, REPLICA would see them run each query command before sending the function, and record exactly which queries they made.

no, I think we are on the same page on how coqtop works. I should have explained it better. yes, REPLICA would see the queries but not see the context (partially written function) in which those queries were made. REPLICA would record a bunch of queries and then a full function definition. For small functions, this may suffice. For larger functions, I think the context of the partially written function would be useful. I want to train LLMs to ask for queries while writing parts of functions, just like humans do.

view this post on Zulip Emilio Jesús Gallego Arias (Dec 04 2023 at 17:57):

Indeed, REPLICA used an STM hook that has quite a few limitations. I took this into consideration when designing coq-lsp/Flèche so you can instrument at a much lower level.

Note that a quite useful thing w.r.t. instrumenting coq-lsp is that it can deployed in web version quite easily, this is very convenient for running 0-install user experiments.


Last updated: Oct 13 2024 at 01:02 UTC