A benchmark of formalized mathematical statements in Coq, lean and isabelle. The paper has some interesting observations about fragmentation of the Coq eco-system.
https://arxiv.org/pdf/2407.11214
https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/PutnamBench
Also I think the alarm folks found some less than ideal translations in places that the authors were happy to address. The Coq folks may be interested in similarly looking over the translations.
Cc @Amitayush Thakur
Yes, the Coq formalizations displayed in the paper seemed more verbose than the Lean ones, probably for the reason that they where written by somewhat inexperienced users.
Still I'm happy they made an effort to formalize this work in three proof assistants.
there is some vindication here for the Coq ecosystem, since they were able to use packages we kept alive in Coq-community and elsewhere (e.g., GeoCoq, Coqtail-math), but there are also serious errors in the paper as described here.
Like I said above, there is still opportunity to fix the formalizations in the benchmark. (Cc @Amitayush Thakur)
as per the other thread, the "right approach" is probably to (re)do everything idiomatically using only MathComp+Analysis, and some GeoCoq. It would likely take quite some time and effort, even discounting the effort to learn to write idiomatic MathComp+Analysis.
but the paper errors are basically orthogonal to this
Hi All,
Thank you so much for the feedback. We will fix the formalizations that you think have issues. It would be great if you could open Github issues for the same with the details. I would also like to emphasize that we are not experts in Coq, there may be conventions that we might have missed.
I can open a general feedback issue on GitHub, but it's really not as simple as pointing to a line of code that should be changed. To give one example, idiomatic MathComp does not really use anything outside mathcomp
, e.g., one does not import ZArith
, Arith
, etc.
and using idiomatic MathComp makes it less likely that CoqHammer will succeed, since the encoding approach in mathcomp is not compatible with CoqHammer's default automation
It would be great if you could create a general issue and show at least a few examples of how can we migrate to to idiomatic MathComp version of formalization.
Last week I opened several issues regarding incorrect or non-idiomatic formalizations, such as this one. It looks to me like there is quite a lot of work to do in order to have reasonable confidence in these formalizations. Not only the Coq ones but also the Lean and Isabelle ones. I would strongly recommend that the authors privately attempt to formalize one or two problems from each math category (without publishing). This should give a lot of insight into what the best formal statements should be.
As a side note: The Tactician benchmark from the paper is not correct because no instrumentation on the background math was done. It would be great if this could be corrected in Arxiv version.
That being said, I don't expect any existing tools such as CoqHammer, Tactician, SMTCoq, itauto, ... to do well on this benchmark. These tools are simply not designed for this, as almost any problem will require at least a basic amount of "conjecturing". As such, I wouldn't worry too much about this:
and using idiomatic MathComp makes it less likely that CoqHammer will succeed, since the encoding approach in mathcomp is not compatible with CoqHammer's default automation
maybe you want to make some comments on GitHub as a memento for this stuff [related to the paper], I was asked to open this issue w.r.t. Coq in the paper: https://github.com/trishullab/PutnamBench/issues/190
Another thing I think would be really important for the authors to address: There should be some stewarding document that states how benchmarks should be run, and how the benchmark should evolve. This should include how bug are being fixed and how this affects previous benchmarks that were run on faulty formalizations. Also, considerations on how the versions of the background knowledge should be evolved. Should libraries like mathcomp be fixed to a single version or not? One of the easiest ways to make progress in this benchmark is to "manually" add all the missing lemmas to mathcomp that might be needed to solve a problem. (Of course, the maturity of background libraries is also a massive issue when comparing automation between proof assistants.)
it would probably make sense to just target the Coq Platform. It will have "everything" needed except for GeoCoq, is my conjecture
Yes, but even then it matters which version of the platform rigth?
Karl Palmskog said:
maybe you want to make some comments on GitHub as a memento for this stuff [related to the paper], I was asked to open this issue w.r.t. Coq in the paper: https://github.com/trishullab/PutnamBench/issues/190
Yeah, maybe I'll do that. But I guess I also wanted to write here so that more people would see my comments :-)
sure, I was just saying it may make sense to double-post things to GitHub that should be fixed for the arXiv paper since most people are not aware of this Zulip or this channel/thread
Lasse Blaauwbroek said:
Yes, but even then it matters which version of the platform rigth?
yes, but at least it's a fully defined thing when you add a release number. Right now, they don't define any versions of anything
Yes, the minimum thing that should be done is to make an Opam package out of it, with fixed or semi-fixed version constraints. And indeed, those constraints should probably be compatible with Coq Platform.
I'm also not sure that this is the best way to encode the problems:
Theorem putnam_XYZZ_ab : stmt.
Proof.
Admitted.
Wouldn't it make more sense to just make the statement a Definition
, so people can prove directly?
for anyone who imports this stuff, it's a real risk they have admitted statements of False
(which could inadvertently be used in automation machinery)
I tried one quick "simple" idiomatic translation here
I wouldn't be too worried about this. Clearly the files in this benchmark are not meant to be imported. Any AI should just have a safeguard not to import other files. If several problems would be in the same file, I agree this might be a problem. (This was the case several weeks ago, but they changed it.)
well, I thought this was not "just" a benchmark for particular AI systems, but also for regular ATP and humans
Do you have an example of how this might go wrong for any human, ai, atp or otherwise?
let's say you import several of the problem statements, for whatever reason. But the automation you are using looks at what you have admitted, and derives False
from it, which it then uses to prove new stuff
more concrete example: https://gitlab.inria.fr/fbesson/itauto/-/issues/31
Yeah, I guess this is hypothetically possible. But I don't really see why any reasonable system would import several problem statements. That being said, I also don't have any real objections to using Definition
instead of Theorem
.
for example, a human might want to prove several statements in the same file. Unfortunately, there is a significant fraction of the Coq userbase who has no idea how to set up a Coq project properly, so they work on single files
did some quick examples of idiomatic conversion to MathComp Analysis, hopefully someone looks at this more seriously: https://github.com/trishullab/PutnamBench/issues/192
Lasse Blaauwbroek said:
Also, considerations on how the versions of the background knowledge should be evolved. Should libraries like mathcomp be fixed to a single version or not? One of the easiest ways to make progress in this benchmark is to "manually" add all the missing lemmas to mathcomp that might be needed to solve a problem. (Of course, the maturity of background libraries is also a massive issue when comparing automation between proof assistants.)
I would advocate for not fixing versions. This benchmark is hard enough, and it is hard enough to develop such tools that too specific of versions would mostly make it difficult to run this benchmark on new solvers. Imagine a new hammer which has some logical background theorems used in it for reconstruction. Or imagine a new Coq Pilot which happens to be trained on a certain version of Coq and Mathcomp. I think it would be a pity if we couldn’t benchmark those things. (It is already a pity we can’t benchmark Graph2Tac on this benchmark because of version differences.). Sure, cheating is possible, but this is academics. We as a community put a spot light on suspicious results until those suspicions are resolved. There is no prize money here that we have to worry about safe guarding, and as you say we are already comparing across different ITPs so it isn’t like there is a great deal of consistency. I would instead hope for large significant progress instead of a 1% improvement on SoTA. (Or for good ablations that shows that some feature helps against a baseline using the same setup.)
the standard approach in the Coq ecosystem is to test a development with a range of Coq versions, and adjust opam bounds so that it's at least known to check properly. Then one can leave the upper version bounds open. But it's generally a waste of everyone's time to allow versions of packages that don't work.
it's not obvious to me how these kinds of benchmarks that formalize natural language should be validated. Even if the statements had full proofs, it could be of another property than the intended one, or the statement could be vacuously true and the proof is just proving something trivial with lots of work.
One could say that the validation is completely manual by humans, but then I think it should be done systematically and documented.
I would advocate for not fixing versions. This benchmark is hard enough [..] Sure, cheating is possible, but this is academics. We as a community put a spot light on suspicious results until those suspicions are resolved.
I'm really not sure what the correct policy on versions should be. This benchmark is hard enough that reaching 80% could easily take longer than 10 years. It does indeed seem unrealistic to keep versions fixed over such a long timeframe. On the other hand, I'm not really worried about cheating. Rather I'm worried that libraries will mature significantly over a period of 10 years. This could potentially make the proofs significantly easier. How do we know that we are actually measuring improvements in AI?
@George Tsoukalas do you have plans to set up any CI for Coq? We have a GitHub Action (using Coq's official Docker images) that could be helpful for that: https://github.com/coq-community/docker-coq-action
But first you probably want to set up a basic Coq build based on coq_makefile/make
or Dune.
There is a setup script at the moment, we are planning on adding a make
file for the build soon. If you can point us to an example repository using the Github Action that would be great, but in any case we should be able to set it up.
this is my "tutorial" repo for setting up Coq builds and CI, files are explained in the README: https://github.com/coq-community/coq-program-verification-template
It has both Dune and coq_makefile/make
builds, but you probably want to only have one kind of build
spotted as an AITP talk:
Brando Miranda, Aryan Gulati, Eric Chen, Kai Fronsdal, Emily Xia and Bruno Dumont. Putnam-MATH: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning.
which sounds related to PutnamBench. Would of course be unfortunate if we get two ITP benchmarks for the same natural language problems...
Given that “MATH” is capitalized, it might be a Putnam-problem variant of the “MATH” benchmark which is a natural language benchmark instead of a formal one. Of course @Brando Miranda would know for sure (or we could wait two weeks for the AITP talk).
I've produced some formalizations changed to depend on mathcomp, and corrected where necessary, in the following PR: https://github.com/trishullab/PutnamBench/pull/201/files. If anyone has any suggestions for making them more idiomatic based on mathcomp definitions and notation, we would greatly appreciate it as we continue to make these modifications!
Last updated: Oct 13 2024 at 01:02 UTC