Comparative evaluation · Machine learning and automation

Stream: Machine learning and automation

Topic: Comparative evaluation

Karl Palmskog (Aug 08 2020 at 16:33):

@Lasse any plans to test Tactician against/with the latest CoqHammer and/or the sauto tactic? https://github.com/lukaszcz/coqhammer-eval

Lasse Blaauwbroek (Aug 08 2020 at 16:34):

There is a comparison with CoqHammer in this paper https://doi.org/10.29007/wg1q . Note that this uses the data of the original CoqHammer paper, so the numbers might have changed a little bit

Karl Palmskog (Aug 08 2020 at 16:35):

Lasse said:

There is a comparison with CoqHammer in this paper https://doi.org/10.29007/wg1q . Note that this uses the data of the original CoqHammer paper, so the numbers might have changed a little bit

Yes, that's why I'm asking. The CoqHammer from the original CoqHammer paper is very different from the current CoqHammer.

Lasse Blaauwbroek (Aug 08 2020 at 16:35):

Conclusion: It is more or less on equal footing with the hammer, depending on how you count

Lasse Blaauwbroek (Aug 08 2020 at 16:37):

Ah oke. Yes, that's still a todo. One of the issues is that CoqHammer's evaluation paradigm is slightly different from mine, and it is not immediately clear how fair it is to compare them

Karl Palmskog (Aug 08 2020 at 16:40):

These days, a lot of goals can also be solved by CoqHammer's non-ATP sauto tactic (https://link.springer.com/chapter/10.1007/978-3-030-51054-1_3). Ideally, we would want a reproducibility study that tests all recent proof automation on a single large benchmark. Personally, I think the Stdlib perhaps should not even be included in such a benchmark, since it arguably doesn't resemble a real-world Coq project anymore.

Lasse Blaauwbroek (Aug 08 2020 at 16:44):

Yes, one long term goal for me is to set up some testing farm that will automatically run all compilable opam projects on my tool and preferably others. The problem of this is that it would require an enormous amount of resources

Karl Palmskog (Aug 08 2020 at 16:47):

@Lasse Guillaume Claret has done quite a lot work in this direction without a large infrastructure: https://coq-bench.github.io/clean/Linux-x86_64-4.07.1-2.0.6/released/

But generally, I think it's better if a benchmark was designed with clear principles for inclusion and exclusion. We designed a dataset of MathComp projects, but this was not for proof automation purposes: https://github.com/EngineeringSoftware/math-comp-corpus

Lasse Blaauwbroek (Aug 08 2020 at 16:47):

But for Tactician one problem I first have to solve is ssreflect. There is no dedicated support for it yet, and without it the predictions are terrible.

Bas Spitters (Aug 08 2020 at 16:47):

@Lasse I believe coq-gym was developed as such a general testing suite. There were some strange issues with the way it harvested libraries, though.

Lasse Blaauwbroek (Aug 08 2020 at 16:49):

As I understand it, coq-gym just harvested all compilable libraries from opam. I see no reason to use their dataset if we can also directly use opam

Bas Spitters (Aug 08 2020 at 16:49):

Some discussion here. BTW is anyone using this in practice?
https://coq.discourse.group/t/machine-learning-and-hammers-for-coq/303/15

Bas Spitters (Aug 08 2020 at 16:50):

Well... you might want to avoid their issues in harvesting from opam...

Lasse Blaauwbroek (Aug 08 2020 at 16:50):

@Karl Palmskog : Yes, I was thinking of something similar to coq bench. But an evaluaion of machine learning techniques would be much more expensive because you have to give the tools a lot of time to prove lemma's.

Karl Palmskog (Aug 08 2020 at 16:51):

I have outlined what needs to be done to make opam harvesting fully automatable: https://github.com/coq/opam-coq-archive/issues/1158

Karl Palmskog (Aug 08 2020 at 16:54):

I think the real burden is in the automation scripts, not the infrastructure issues in themselves, since even the Flyspeck project used the cloud to check their proofs

Lasse Blaauwbroek (Aug 08 2020 at 16:56):

I don't immediately see a problem with the automation. opam should be able to automatically figure out what is installable on what Coq version, and then you just install everything, while having the appropriate environment variables set to make Coq do the evaluation. Am I missing something?

Lasse Blaauwbroek (Aug 08 2020 at 16:57):

There is even an option in opam to ignore any errors during installation and just skip that package

Karl Palmskog (Aug 08 2020 at 16:58):

many machine learning solutions will need to instrument builds in various ways. They will need to consume the metadata and provide additional automation. This is why we provide https://coq.inria.fr/opam/coq-packages.json -- but it's not yet complete

Karl Palmskog (Aug 08 2020 at 16:59):

we don't want people to have to modify opam itself or do their scripting on top of opam just to be able to fetch and build some Coq files

Lasse Blaauwbroek (Aug 08 2020 at 17:01):

Ah, I see what you mean. My personal opinion is that there is a clear difference between end-user systems and AI systems. A direct comparison between these would probably be rather unfair. And it seems to me that automatically running all opam packages on the systems without an end-user interface would take a lot of engineering (like you already said).

Lasse Blaauwbroek (Aug 08 2020 at 17:02):

Tactician can be configured to learn from opam projects while they are being compiled through standard opam commands

Lasse Blaauwbroek (Aug 08 2020 at 17:05):

For a fair comparison you would also need to think about how to reconcile systems that rely on training/testing/validating sets vs system that use a 'rolling' evaluation.

Lasse Blaauwbroek (Aug 08 2020 at 17:17):

@Karl Palmskog With the mathcomp corpus, do you provide this for all Coq versions from the past years in such a way that there is no difference in the versions? If newer versions of a package are different that older ones, the utility of the corpus would quickly go down, because you cannot compare evaluations through time.

Karl Palmskog (Aug 08 2020 at 17:19):

the MathComp corpus is a snapshot based on MC version 1.9.0 and Coq 8.10.2. It's unrelated to any packages and so on, and frozen in time. If we need a "refreshed" corpus at some future point, we're going to build it from scratch. In addition, the corpus was not designed to be a benchmark for automation.

Karl Palmskog (Aug 08 2020 at 17:23):

nevertheless, the projects that are part of the corpus still exist. My proposal for a re-evaluation of proof automation techniques would mean:

assemble a benchmark for a specific Coq version based on clear principles relevant to automation
evaluate all techniques/tools on the same benchmark on the same Coq version

Bas Spitters (Aug 08 2020 at 17:39):

@Lasse did you try tactician on corn/math-classes? SF?

Lasse Blaauwbroek (Aug 08 2020 at 17:42):

@Karl Palmskog Oke fair enough. It seems to me that a static corpus available in many Coq versions would have the most utility. But it would be a lot of work to maintain...

Lasse Blaauwbroek (Aug 08 2020 at 17:43):

@Bas Spitters No not yet. I'll try to do a more comprehensive eval when I make a 1.0 release

Notification Bot (Aug 09 2020 at 06:16):

This topic was moved here from #Miscellaneous > Zulip stream on AI and hammers? by Théo Zimmermann

Łukasz Czajka (Aug 13 2020 at 13:14):

Karl Palmskog said:

Lasse said:

There is a comparison with CoqHammer in this paper https://doi.org/10.29007/wg1q . Note that this uses the data of the original CoqHammer paper, so the numbers might have changed a little bit

Yes, that's why I'm asking. The CoqHammer from the original CoqHammer paper is very different from the current CoqHammer.

The reconstruction backed changed substantially and now it can be also used independently as the "sauto" tactic (which I would compare to "auto"/"blast" in Isabelle/HOL, in terms of how it's supposed to be used and how strong the tactic is).

However, other components (premise selection, translation to ATPs) changed very little, and now they seem to be the main bottleneck. Perhaps the largest change to "hammer", aside of the new reconstruction backend, is support for CVC4, and a different "greedy sequence". But I wouldn't actually expect that much of a difference with the first version of CoqHammer as far as the "hammer" tactic alone is concerned.

Karl Palmskog (Aug 13 2020 at 20:00):

when I used the initial CoqHammer, reconstructions failures were quite common, so I still believe the bottom line (number of successful proofs) will be quite different for CoqHammer for 8.5/8.6 and the latest release for 8.12. Moreover, I believe we now also have SMTCoq to compare with.

Last updated: Apr 18 2024 at 07:02 UTC