Stream: Machine learning and automation

Topic: Proof Automation with Large Language Models


view this post on Zulip Jason Rute (Sep 24 2024 at 20:49):

I saw this paper today: Proof Automation with Large Language Models by Minghai Lu, Benjamin Delaware, and Tianyi Zhang (none appear to be on Zulip). I haven't read it all, but a few notes:

view this post on Zulip Karl Palmskog (Sep 24 2024 at 21:16):

how do you even solve the LLM problem of training on the test set? We have historically had a problem with there not being benchmarks at all... and now if we release them publicly, they will be off limits to LLM research

view this post on Zulip Karl Palmskog (Sep 24 2024 at 21:17):

as I've written elsewhere, the CoqGym dataset even contains lots of quite terrible Coq code (20+ years old)

view this post on Zulip Jason Rute (Sep 25 2024 at 11:39):

Karl Palmskog said:

how do you even solve the LLM problem of training on the test set? We have historically had a problem with there not being benchmarks at all... and now if we release them publicly, they will be off limits to LLM research

I agree it is challenging. Here are some steps in that direction:

view this post on Zulip Karl Palmskog (Sep 25 2024 at 11:41):

secret evaluation set is (to me at least) quite bad for open science, especially if used as some sort of advantage in publishing (I got this score, you can't even get a score because I won't share)

view this post on Zulip Jason Rute (Sep 25 2024 at 11:43):

There is a bit of that. Google has a HiddenMath data set. But the other ones I've mentioned are administered through Kaggle and I think lead to quite open science. For the first AIMO prize, you can only win if you agree to publish your model openly after the competition is over.

view this post on Zulip Karl Palmskog (Sep 25 2024 at 11:50):

it may be the next business model then to "own" a bunch of known-to-be-secret benchmarks (e.g., based on proprietary code) that researchers pay for to use in LLM research. A bit dystopian to me.

view this post on Zulip Jason Rute (Sep 25 2024 at 11:52):

For the first AIMO prize (which wasn't using formal theorem proving, although I heard the second one might be), there were 110 problems. 10 were public. 50 were semiprivate in that researchers couldn't see them, but they could submit their code a few times a day and the results would appear on a public leaderboard. Then after the competition was over, the best code submission of each team were run on 50 completely private problems.

view this post on Zulip Jason Rute (Sep 25 2024 at 11:54):

I personally like the idea of regular competitions where the problems are secret up until the competition and then released at the end of the competition, but it is hard for me to know how practical this is.

view this post on Zulip Jason Rute (Sep 25 2024 at 11:55):

If the ITP community hosted this, they could set the standard for what sorts of problems they want AI to solve.

view this post on Zulip Karl Palmskog (Sep 25 2024 at 11:57):

I think the moral hazards and potential conflicts of interest are pretty terrible with some people in a community maintaining secret benchmarks that their colleagues use and report scores on. Ideally the maintainer has no ties to AI

view this post on Zulip Jason Rute (Sep 25 2024 at 12:03):

I don't know. I think the ARC Prize has been pretty helpful in AI research even though the creator is an AI researcher. Sure he could slip the super secret problems to a friend, but what does he gain from doing that? His whole point here is to show that modern models aren't good enough.

view this post on Zulip Karl Palmskog (Sep 25 2024 at 12:06):

also, I think "ITP community" is very fractured in applications. I would personally like to see PL metatheory benchmarks like POPLMark but for AI

view this post on Zulip Michael Soegtrop (Sep 25 2024 at 12:06):

The only way I see is to split each individual benchmark set in test and training and mark this by some commonly agreed on standard and make sure that LLMs are only trained on the training sets. One needs the buy in of the LLM creators for this, but they should have an interest that their creations remain testable. I think one can find out if an LLM has seen the test set by testing how much it reproduces verbatim from the test set. Of course this also requires some thoughts and testing on the training/test set to avoid to much overlap.

view this post on Zulip Karl Palmskog (Sep 25 2024 at 12:07):

POPLMark is to me still the most successful ITP benchmark ever

view this post on Zulip Jason Rute (Sep 25 2024 at 12:10):

@Michael Soegtrop The commonly agreed upon standard is I believe a canary string. I also don't know the details but at least for models you can download, I think there are ways to do that sort of verbatim testing. (Note it is more complicated than getting an answer verbatim since LLMs are good at returning similar but not exactly the same things as their training data.)

view this post on Zulip Jason Rute (Sep 25 2024 at 12:12):

@Karl Palmskog what made POPLMark successful? Could those principles be useful today with AI testing or was that just a different era?

view this post on Zulip Jason Rute (Sep 25 2024 at 12:12):

The thing that frustrates me so much right now is that we have absolutely no idea what works in AI for ITP because our benchmarks are so broken.

view this post on Zulip Karl Palmskog (Sep 25 2024 at 12:16):

Jason Rute said:

Karl Palmskog what made POPLMark successful? Could those principles be useful today with AI testing or was that just a different era?

To be clear, I meant success in how many people actually tried to beat it in different systems and published different clever solutions. Unfortunately, I don't think it lead to all that many improvements in tools.

I think they managed to reach an audience with a competitive streak with the right problems, the scientific analogue of people who sit up late and hack together solutions to Advent of Code or similar.

view this post on Zulip Michael Soegtrop (Sep 25 2024 at 12:24):

@Jason Rute : do the common ITP benchmarks include such "canary strings" and does one have the impression that these are honoured by the LLMs used in proof automation papers?

view this post on Zulip Michael Soegtrop (Sep 25 2024 at 12:26):

Maybe one could test this by putting some stuff one usually wouldn't do into the "canary string" marked sections.

view this post on Zulip Jason Rute (Sep 25 2024 at 12:27):

No, I’m not aware of any ITP benchmark using canary strings. But also the problem is that most of them are based around real world data like ITP code which is on GitHub or publicly available math competition problems.

view this post on Zulip Michael Soegtrop (Sep 25 2024 at 12:32):

Then I would conclude that using canary strings is not a feasible method and the ITP community needs to come up with something else, say exclusion/classification lists which are well communicated. One could make these part of the benchmark collections but state that LLM trainers should ensure that if they get a source via different paths, the exclusions should be ored.

view this post on Zulip Jason Rute (Sep 25 2024 at 12:51):

Remember the LLM trainers are OpenAI, Google, Facebook, etc. And moreover, whatever has already been trained on is not going to be un-trained.

view this post on Zulip Michael Soegtrop (Sep 25 2024 at 16:10):

Remember the LLM trainers are OpenAI, Google, Facebook, etc.

I think they have an interest to make their LLMs better testable - this is still in its infancy and testability would be the first step out of there

And moreover, whatever has already been trained on is not going to be un-trained

This is indeed an issue I didn't think of


Last updated: Oct 13 2024 at 01:02 UTC