Stream: Machine learning and automation

Topic: PutnamBench


view this post on Zulip Bas Spitters (Jul 23 2024 at 09:24):

A benchmark of formalized mathematical statements in Coq, lean and isabelle. The paper has some interesting observations about fragmentation of the Coq eco-system.
https://arxiv.org/pdf/2407.11214
https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/PutnamBench

view this post on Zulip Jason Rute (Jul 24 2024 at 07:32):

Also I think the alarm folks found some less than ideal translations in places that the authors were happy to address. The Coq folks may be interested in similarly looking over the translations.

view this post on Zulip Jason Rute (Jul 24 2024 at 07:32):

Cc @Amitayush Thakur

view this post on Zulip Bas Spitters (Jul 24 2024 at 07:53):

Yes, the Coq formalizations displayed in the paper seemed more verbose than the Lean ones, probably for the reason that they where written by somewhat inexperienced users.
Still I'm happy they made an effort to formalize this work in three proof assistants.

view this post on Zulip Karl Palmskog (Aug 01 2024 at 14:53):

there is some vindication here for the Coq ecosystem, since they were able to use packages we kept alive in Coq-community and elsewhere (e.g., GeoCoq, Coqtail-math), but there are also serious errors in the paper as described here.

view this post on Zulip Jason Rute (Aug 01 2024 at 15:01):

Like I said above, there is still opportunity to fix the formalizations in the benchmark. (Cc @Amitayush Thakur)

view this post on Zulip Karl Palmskog (Aug 01 2024 at 15:06):

as per the other thread, the "right approach" is probably to (re)do everything idiomatically using only MathComp+Analysis, and some GeoCoq. It would likely take quite some time and effort, even discounting the effort to learn to write idiomatic MathComp+Analysis.

view this post on Zulip Karl Palmskog (Aug 01 2024 at 15:07):

but the paper errors are basically orthogonal to this

view this post on Zulip Amitayush Thakur (Aug 01 2024 at 15:27):

Hi All,
Thank you so much for the feedback. We will fix the formalizations that you think have issues. It would be great if you could open Github issues for the same with the details. I would also like to emphasize that we are not experts in Coq, there may be conventions that we might have missed.

view this post on Zulip Karl Palmskog (Aug 01 2024 at 15:29):

I can open a general feedback issue on GitHub, but it's really not as simple as pointing to a line of code that should be changed. To give one example, idiomatic MathComp does not really use anything outside mathcomp, e.g., one does not import ZArith, Arith, etc.

view this post on Zulip Karl Palmskog (Aug 01 2024 at 15:30):

and using idiomatic MathComp makes it less likely that CoqHammer will succeed, since the encoding approach in mathcomp is not compatible with CoqHammer's default automation

view this post on Zulip Amitayush Thakur (Aug 01 2024 at 15:31):

It would be great if you could create a general issue and show at least a few examples of how can we migrate to to idiomatic MathComp version of formalization.

view this post on Zulip Lasse Blaauwbroek (Aug 02 2024 at 19:02):

Last week I opened several issues regarding incorrect or non-idiomatic formalizations, such as this one. It looks to me like there is quite a lot of work to do in order to have reasonable confidence in these formalizations. Not only the Coq ones but also the Lean and Isabelle ones. I would strongly recommend that the authors privately attempt to formalize one or two problems from each math category (without publishing). This should give a lot of insight into what the best formal statements should be.

view this post on Zulip Lasse Blaauwbroek (Aug 02 2024 at 19:07):

As a side note: The Tactician benchmark from the paper is not correct because no instrumentation on the background math was done. It would be great if this could be corrected in Arxiv version.

That being said, I don't expect any existing tools such as CoqHammer, Tactician, SMTCoq, itauto, ... to do well on this benchmark. These tools are simply not designed for this, as almost any problem will require at least a basic amount of "conjecturing". As such, I wouldn't worry too much about this:

and using idiomatic MathComp makes it less likely that CoqHammer will succeed, since the encoding approach in mathcomp is not compatible with CoqHammer's default automation

view this post on Zulip Karl Palmskog (Aug 02 2024 at 19:09):

maybe you want to make some comments on GitHub as a memento for this stuff [related to the paper], I was asked to open this issue w.r.t. Coq in the paper: https://github.com/trishullab/PutnamBench/issues/190

view this post on Zulip Lasse Blaauwbroek (Aug 02 2024 at 19:14):

Another thing I think would be really important for the authors to address: There should be some stewarding document that states how benchmarks should be run, and how the benchmark should evolve. This should include how bug are being fixed and how this affects previous benchmarks that were run on faulty formalizations. Also, considerations on how the versions of the background knowledge should be evolved. Should libraries like mathcomp be fixed to a single version or not? One of the easiest ways to make progress in this benchmark is to "manually" add all the missing lemmas to mathcomp that might be needed to solve a problem. (Of course, the maturity of background libraries is also a massive issue when comparing automation between proof assistants.)

view this post on Zulip Karl Palmskog (Aug 02 2024 at 19:15):

it would probably make sense to just target the Coq Platform. It will have "everything" needed except for GeoCoq, is my conjecture

view this post on Zulip Lasse Blaauwbroek (Aug 02 2024 at 19:15):

Yes, but even then it matters which version of the platform rigth?

view this post on Zulip Lasse Blaauwbroek (Aug 02 2024 at 19:16):

Karl Palmskog said:

maybe you want to make some comments on GitHub as a memento for this stuff [related to the paper], I was asked to open this issue w.r.t. Coq in the paper: https://github.com/trishullab/PutnamBench/issues/190

Yeah, maybe I'll do that. But I guess I also wanted to write here so that more people would see my comments :-)

view this post on Zulip Karl Palmskog (Aug 02 2024 at 19:18):

sure, I was just saying it may make sense to double-post things to GitHub that should be fixed for the arXiv paper since most people are not aware of this Zulip or this channel/thread

view this post on Zulip Karl Palmskog (Aug 02 2024 at 19:19):

Lasse Blaauwbroek said:

Yes, but even then it matters which version of the platform rigth?

yes, but at least it's a fully defined thing when you add a release number. Right now, they don't define any versions of anything

view this post on Zulip Lasse Blaauwbroek (Aug 02 2024 at 19:21):

Yes, the minimum thing that should be done is to make an Opam package out of it, with fixed or semi-fixed version constraints. And indeed, those constraints should probably be compatible with Coq Platform.

view this post on Zulip Karl Palmskog (Aug 02 2024 at 19:29):

I'm also not sure that this is the best way to encode the problems:

Theorem putnam_XYZZ_ab : stmt.
Proof.
Admitted.

Wouldn't it make more sense to just make the statement a Definition, so people can prove directly?

view this post on Zulip Karl Palmskog (Aug 02 2024 at 19:30):

for anyone who imports this stuff, it's a real risk they have admitted statements of False (which could inadvertently be used in automation machinery)

view this post on Zulip Karl Palmskog (Aug 02 2024 at 19:36):

I tried one quick "simple" idiomatic translation here

view this post on Zulip Lasse Blaauwbroek (Aug 02 2024 at 19:58):

I wouldn't be too worried about this. Clearly the files in this benchmark are not meant to be imported. Any AI should just have a safeguard not to import other files. If several problems would be in the same file, I agree this might be a problem. (This was the case several weeks ago, but they changed it.)

view this post on Zulip Karl Palmskog (Aug 02 2024 at 20:00):

well, I thought this was not "just" a benchmark for particular AI systems, but also for regular ATP and humans

view this post on Zulip Lasse Blaauwbroek (Aug 02 2024 at 20:23):

Do you have an example of how this might go wrong for any human, ai, atp or otherwise?

view this post on Zulip Karl Palmskog (Aug 02 2024 at 20:28):

let's say you import several of the problem statements, for whatever reason. But the automation you are using looks at what you have admitted, and derives False from it, which it then uses to prove new stuff

view this post on Zulip Karl Palmskog (Aug 02 2024 at 20:29):

more concrete example: https://gitlab.inria.fr/fbesson/itauto/-/issues/31

view this post on Zulip Lasse Blaauwbroek (Aug 02 2024 at 20:32):

Yeah, I guess this is hypothetically possible. But I don't really see why any reasonable system would import several problem statements. That being said, I also don't have any real objections to using Definition instead of Theorem.

view this post on Zulip Karl Palmskog (Aug 02 2024 at 20:33):

for example, a human might want to prove several statements in the same file. Unfortunately, there is a significant fraction of the Coq userbase who has no idea how to set up a Coq project properly, so they work on single files

view this post on Zulip Karl Palmskog (Aug 03 2024 at 10:06):

did some quick examples of idiomatic conversion to MathComp Analysis, hopefully someone looks at this more seriously: https://github.com/trishullab/PutnamBench/issues/192

view this post on Zulip Jason Rute (Aug 03 2024 at 10:44):

Lasse Blaauwbroek said:

Also, considerations on how the versions of the background knowledge should be evolved. Should libraries like mathcomp be fixed to a single version or not? One of the easiest ways to make progress in this benchmark is to "manually" add all the missing lemmas to mathcomp that might be needed to solve a problem. (Of course, the maturity of background libraries is also a massive issue when comparing automation between proof assistants.)

I would advocate for not fixing versions. This benchmark is hard enough, and it is hard enough to develop such tools that too specific of versions would mostly make it difficult to run this benchmark on new solvers. Imagine a new hammer which has some logical background theorems used in it for reconstruction. Or imagine a new Coq Pilot which happens to be trained on a certain version of Coq and Mathcomp. I think it would be a pity if we couldn’t benchmark those things. (It is already a pity we can’t benchmark Graph2Tac on this benchmark because of version differences.). Sure, cheating is possible, but this is academics. We as a community put a spot light on suspicious results until those suspicions are resolved. There is no prize money here that we have to worry about safe guarding, and as you say we are already comparing across different ITPs so it isn’t like there is a great deal of consistency. I would instead hope for large significant progress instead of a 1% improvement on SoTA. (Or for good ablations that shows that some feature helps against a baseline using the same setup.)

view this post on Zulip Karl Palmskog (Aug 03 2024 at 10:52):

the standard approach in the Coq ecosystem is to test a development with a range of Coq versions, and adjust opam bounds so that it's at least known to check properly. Then one can leave the upper version bounds open. But it's generally a waste of everyone's time to allow versions of packages that don't work.

view this post on Zulip Karl Palmskog (Aug 03 2024 at 18:43):

it's not obvious to me how these kinds of benchmarks that formalize natural language should be validated. Even if the statements had full proofs, it could be of another property than the intended one, or the statement could be vacuously true and the proof is just proving something trivial with lots of work.

One could say that the validation is completely manual by humans, but then I think it should be done systematically and documented.

view this post on Zulip Lasse Blaauwbroek (Aug 03 2024 at 23:02):

I would advocate for not fixing versions. This benchmark is hard enough [..] Sure, cheating is possible, but this is academics. We as a community put a spot light on suspicious results until those suspicions are resolved.

I'm really not sure what the correct policy on versions should be. This benchmark is hard enough that reaching 80% could easily take longer than 10 years. It does indeed seem unrealistic to keep versions fixed over such a long timeframe. On the other hand, I'm not really worried about cheating. Rather I'm worried that libraries will mature significantly over a period of 10 years. This could potentially make the proofs significantly easier. How do we know that we are actually measuring improvements in AI?

view this post on Zulip Karl Palmskog (Aug 05 2024 at 21:09):

@George Tsoukalas do you have plans to set up any CI for Coq? We have a GitHub Action (using Coq's official Docker images) that could be helpful for that: https://github.com/coq-community/docker-coq-action

But first you probably want to set up a basic Coq build based on coq_makefile/make or Dune.

view this post on Zulip George Tsoukalas (Aug 06 2024 at 15:26):

There is a setup script at the moment, we are planning on adding a make file for the build soon. If you can point us to an example repository using the Github Action that would be great, but in any case we should be able to set it up.

view this post on Zulip Karl Palmskog (Aug 06 2024 at 16:03):

this is my "tutorial" repo for setting up Coq builds and CI, files are explained in the README: https://github.com/coq-community/coq-program-verification-template

It has both Dune and coq_makefile/make builds, but you probably want to only have one kind of build

view this post on Zulip Karl Palmskog (Aug 12 2024 at 12:29):

spotted as an AITP talk:

Brando Miranda, Aryan Gulati, Eric Chen, Kai Fronsdal, Emily Xia and Bruno Dumont. Putnam-MATH: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning.

which sounds related to PutnamBench. Would of course be unfortunate if we get two ITP benchmarks for the same natural language problems...

view this post on Zulip Jason Rute (Aug 19 2024 at 16:04):

Given that “MATH” is capitalized, it might be a Putnam-problem variant of the “MATH” benchmark which is a natural language benchmark instead of a formal one. Of course @Brando Miranda would know for sure (or we could wait two weeks for the AITP talk).

view this post on Zulip George Tsoukalas (Sep 02 2024 at 14:21):

I've produced some formalizations changed to depend on mathcomp, and corrected where necessary, in the following PR: https://github.com/trishullab/PutnamBench/pull/201/files. If anyone has any suggestions for making them more idiomatic based on mathcomp definitions and notation, we would greatly appreciate it as we continue to make these modifications!


Last updated: Oct 13 2024 at 01:02 UTC