@Lasse any plans to test Tactician against/with the latest CoqHammer and/or the sauto
tactic? https://github.com/lukaszcz/coqhammer-eval
There is a comparison with CoqHammer in this paper https://doi.org/10.29007/wg1q . Note that this uses the data of the original CoqHammer paper, so the numbers might have changed a little bit
Lasse said:
There is a comparison with CoqHammer in this paper https://doi.org/10.29007/wg1q . Note that this uses the data of the original CoqHammer paper, so the numbers might have changed a little bit
Yes, that's why I'm asking. The CoqHammer from the original CoqHammer paper is very different from the current CoqHammer.
Conclusion: It is more or less on equal footing with the hammer, depending on how you count
Ah oke. Yes, that's still a todo. One of the issues is that CoqHammer's evaluation paradigm is slightly different from mine, and it is not immediately clear how fair it is to compare them
These days, a lot of goals can also be solved by CoqHammer's non-ATP sauto
tactic (https://link.springer.com/chapter/10.1007/978-3-030-51054-1_3). Ideally, we would want a reproducibility study that tests all recent proof automation on a single large benchmark. Personally, I think the Stdlib perhaps should not even be included in such a benchmark, since it arguably doesn't resemble a real-world Coq project anymore.
Yes, one long term goal for me is to set up some testing farm that will automatically run all compilable opam projects on my tool and preferably others. The problem of this is that it would require an enormous amount of resources
@Lasse Guillaume Claret has done quite a lot work in this direction without a large infrastructure: https://coq-bench.github.io/clean/Linux-x86_64-4.07.1-2.0.6/released/
But generally, I think it's better if a benchmark was designed with clear principles for inclusion and exclusion. We designed a dataset of MathComp projects, but this was not for proof automation purposes: https://github.com/EngineeringSoftware/math-comp-corpus
But for Tactician one problem I first have to solve is ssreflect. There is no dedicated support for it yet, and without it the predictions are terrible.
@Lasse I believe coq-gym was developed as such a general testing suite. There were some strange issues with the way it harvested libraries, though.
As I understand it, coq-gym just harvested all compilable libraries from opam. I see no reason to use their dataset if we can also directly use opam
Some discussion here. BTW is anyone using this in practice?
https://coq.discourse.group/t/machine-learning-and-hammers-for-coq/303/15
Well... you might want to avoid their issues in harvesting from opam...
@Karl Palmskog : Yes, I was thinking of something similar to coq bench. But an evaluaion of machine learning techniques would be much more expensive because you have to give the tools a lot of time to prove lemma's.
I have outlined what needs to be done to make opam harvesting fully automatable: https://github.com/coq/opam-coq-archive/issues/1158
I think the real burden is in the automation scripts, not the infrastructure issues in themselves, since even the Flyspeck project used the cloud to check their proofs
I don't immediately see a problem with the automation. opam
should be able to automatically figure out what is installable on what Coq version, and then you just install everything, while having the appropriate environment variables set to make Coq do the evaluation. Am I missing something?
There is even an option in opam
to ignore any errors during installation and just skip that package
many machine learning solutions will need to instrument builds in various ways. They will need to consume the metadata and provide additional automation. This is why we provide https://coq.inria.fr/opam/coq-packages.json -- but it's not yet complete
we don't want people to have to modify opam itself or do their scripting on top of opam just to be able to fetch and build some Coq files
Ah, I see what you mean. My personal opinion is that there is a clear difference between end-user systems and AI systems. A direct comparison between these would probably be rather unfair. And it seems to me that automatically running all opam packages on the systems without an end-user interface would take a lot of engineering (like you already said).
Tactician can be configured to learn from opam projects while they are being compiled through standard opam commands
For a fair comparison you would also need to think about how to reconcile systems that rely on training/testing/validating sets vs system that use a 'rolling' evaluation.
@Karl Palmskog With the mathcomp corpus, do you provide this for all Coq versions from the past years in such a way that there is no difference in the versions? If newer versions of a package are different that older ones, the utility of the corpus would quickly go down, because you cannot compare evaluations through time.
the MathComp corpus is a snapshot based on MC version 1.9.0 and Coq 8.10.2. It's unrelated to any packages and so on, and frozen in time. If we need a "refreshed" corpus at some future point, we're going to build it from scratch. In addition, the corpus was not designed to be a benchmark for automation.
nevertheless, the projects that are part of the corpus still exist. My proposal for a re-evaluation of proof automation techniques would mean:
@Lasse did you try tactician on corn/math-classes? SF?
@Karl Palmskog Oke fair enough. It seems to me that a static corpus available in many Coq versions would have the most utility. But it would be a lot of work to maintain...
@Bas Spitters No not yet. I'll try to do a more comprehensive eval when I make a 1.0 release
This topic was moved here from #Miscellaneous > Zulip stream on AI and hammers? by Théo Zimmermann
Karl Palmskog said:
Lasse said:
There is a comparison with CoqHammer in this paper https://doi.org/10.29007/wg1q . Note that this uses the data of the original CoqHammer paper, so the numbers might have changed a little bit
Yes, that's why I'm asking. The CoqHammer from the original CoqHammer paper is very different from the current CoqHammer.
The reconstruction backed changed substantially and now it can be also used independently as the "sauto" tactic (which I would compare to "auto"/"blast" in Isabelle/HOL, in terms of how it's supposed to be used and how strong the tactic is).
However, other components (premise selection, translation to ATPs) changed very little, and now they seem to be the main bottleneck. Perhaps the largest change to "hammer", aside of the new reconstruction backend, is support for CVC4, and a different "greedy sequence". But I wouldn't actually expect that much of a difference with the first version of CoqHammer as far as the "hammer" tactic alone is concerned.
when I used the initial CoqHammer, reconstructions failures were quite common, so I still believe the bottom line (number of successful proofs) will be quite different for CoqHammer for 8.5/8.6 and the latest release for 8.12. Moreover, I believe we now also have SMTCoq to compare with.
Last updated: Oct 13 2024 at 01:02 UTC