I saw this paper today: Proof Automation with Large Language Models by Minghai Lu, Benjamin Delaware, and Tianyi Zhang (none appear to be on Zulip). I haven't read it all, but a few notes:
how do you even solve the LLM problem of training on the test set? We have historically had a problem with there not being benchmarks at all... and now if we release them publicly, they will be off limits to LLM research
as I've written elsewhere, the CoqGym dataset even contains lots of quite terrible Coq code (20+ years old)
Karl Palmskog said:
how do you even solve the LLM problem of training on the test set? We have historically had a problem with there not being benchmarks at all... and now if we release them publicly, they will be off limits to LLM research
I agree it is challenging. Here are some steps in that direction:
secret evaluation set is (to me at least) quite bad for open science, especially if used as some sort of advantage in publishing (I got this score, you can't even get a score because I won't share)
There is a bit of that. Google has a HiddenMath data set. But the other ones I've mentioned are administered through Kaggle and I think lead to quite open science. For the first AIMO prize, you can only win if you agree to publish your model openly after the competition is over.
it may be the next business model then to "own" a bunch of known-to-be-secret benchmarks (e.g., based on proprietary code) that researchers pay for to use in LLM research. A bit dystopian to me.
For the first AIMO prize (which wasn't using formal theorem proving, although I heard the second one might be), there were 110 problems. 10 were public. 50 were semiprivate in that researchers couldn't see them, but they could submit their code a few times a day and the results would appear on a public leaderboard. Then after the competition was over, the best code submission of each team were run on 50 completely private problems.
I personally like the idea of regular competitions where the problems are secret up until the competition and then released at the end of the competition, but it is hard for me to know how practical this is.
If the ITP community hosted this, they could set the standard for what sorts of problems they want AI to solve.
I think the moral hazards and potential conflicts of interest are pretty terrible with some people in a community maintaining secret benchmarks that their colleagues use and report scores on. Ideally the maintainer has no ties to AI
I don't know. I think the ARC Prize has been pretty helpful in AI research even though the creator is an AI researcher. Sure he could slip the super secret problems to a friend, but what does he gain from doing that? His whole point here is to show that modern models aren't good enough.
also, I think "ITP community" is very fractured in applications. I would personally like to see PL metatheory benchmarks like POPLMark but for AI
The only way I see is to split each individual benchmark set in test and training and mark this by some commonly agreed on standard and make sure that LLMs are only trained on the training sets. One needs the buy in of the LLM creators for this, but they should have an interest that their creations remain testable. I think one can find out if an LLM has seen the test set by testing how much it reproduces verbatim from the test set. Of course this also requires some thoughts and testing on the training/test set to avoid to much overlap.
POPLMark is to me still the most successful ITP benchmark ever
@Michael Soegtrop The commonly agreed upon standard is I believe a canary string. I also don't know the details but at least for models you can download, I think there are ways to do that sort of verbatim testing. (Note it is more complicated than getting an answer verbatim since LLMs are good at returning similar but not exactly the same things as their training data.)
@Karl Palmskog what made POPLMark successful? Could those principles be useful today with AI testing or was that just a different era?
The thing that frustrates me so much right now is that we have absolutely no idea what works in AI for ITP because our benchmarks are so broken.
Jason Rute said:
Karl Palmskog what made POPLMark successful? Could those principles be useful today with AI testing or was that just a different era?
To be clear, I meant success in how many people actually tried to beat it in different systems and published different clever solutions. Unfortunately, I don't think it lead to all that many improvements in tools.
I think they managed to reach an audience with a competitive streak with the right problems, the scientific analogue of people who sit up late and hack together solutions to Advent of Code or similar.
@Jason Rute : do the common ITP benchmarks include such "canary strings" and does one have the impression that these are honoured by the LLMs used in proof automation papers?
Maybe one could test this by putting some stuff one usually wouldn't do into the "canary string" marked sections.
No, I’m not aware of any ITP benchmark using canary strings. But also the problem is that most of them are based around real world data like ITP code which is on GitHub or publicly available math competition problems.
Then I would conclude that using canary strings is not a feasible method and the ITP community needs to come up with something else, say exclusion/classification lists which are well communicated. One could make these part of the benchmark collections but state that LLM trainers should ensure that if they get a source via different paths, the exclusions should be ored.
Remember the LLM trainers are OpenAI, Google, Facebook, etc. And moreover, whatever has already been trained on is not going to be un-trained.
Remember the LLM trainers are OpenAI, Google, Facebook, etc.
I think they have an interest to make their LLMs better testable - this is still in its infancy and testability would be the first step out of there
And moreover, whatever has already been trained on is not going to be un-trained
This is indeed an issue I didn't think of
Last updated: Oct 13 2024 at 01:02 UTC