The Isabelle community is organizing a competition. Any experiences from last year? Is it fair? Why did Isabelle win?

https://www21.in.tum.de/~wimmers/proofground/

In my experience, it was reasonably fair, though it would be nice to see some challenge problems offered by e.g. the Coq community. If I recall correctly, the big edge was formalization experience, and standard library maturity, albeit for relatively unsophisticated mathematics, with a single exception: a certain theorem about arbitrary sets and functions which was significantly easier to prove classically, and so advantaged provers based on classical logic. Even adding EM as an axiom only marginally helps, since then one needs to prove lemmas based on it, losing precious time.

Time was the obvious bottleneck, and Hammer style proof dispatch was pretty intimidating from the Isabelle side. I do wish there had been more ssreflect hackers, since I wonder if they have tricks to make proofs faster.

seems obvious that a competition organized by the Isabelle community will be won by that same community. Do they have any rules about challenge problems? One could easily construct problems that are orders of magnitude more easily solved in CIC/cubical, but how would those problems be considered by HOL users (probably as unfair)?

was (the whole of) MathComp even available to Coq users? (Since Lean users apparently had the whole of Mathlib)

Hi there, I'm one of the organizers of Proof ground. We had a Call for Problems which we posted on the Coq mailinglist. Also we had two submissions of problems in Coq that made it into the competition. We tried hard to make it fair between the provers and people from the coq and Lean community helped us translating the tasks.

We are always happy for ideas for new tasks, and there will be a edition with next year's ITP hopefully with an onsite contest in Rome.

When finding fair tasks it is always the problem that we have to take the intersection of the material that is conveniently available in all (three) provers. That makes it even harder to come up with interesting, fair and solvable within under 2 hours.

Karl Palmskog said:

was (the whole of) MathComp even available to Coq users? (Since Lean users apparently had the whole of Mathlib)

what libraries would you like to have available in such a competition? We definitely have to add more there to be fair against Lean's mathlib and Isabelle's material in the distribution.

@maximilian p.l. haslbeck the "standard sets of libraries" for Coq is huge, probably running about a million lines of code.

we are trying to define the "Coq Platform" which includes nearly all most-used libraries/projects, but this will not be generally available for a while

so, include everything?

ah, yeah, at least, people that want to participate could tell us which libraries they want, then we can add them

right, but this is going to be a big problem as Coq evolves

so if you want Proof Ground-like events to be recurring, probably the only viable solution for Coq is to tie it to the Coq Platform (which will have regular releases after every major Coq version)

@maximilian p.l. haslbeck but generally, how can one have an even playing field for such different logics? As I wrote earlier in the thread, depending on the problem, either Isabelle/HOL+Sledgehammer may have an incredible advantage or a definition may be impossible to express at all in HOL

yeah, of course Isabelle does not have dependent types. So lots of interesting problems can't be posed in all provers, that's for sure. I quess we have to use the common denominator.

I think I know too little about Coq Platform. but if the community agrees we can make it standard for Proof Ground.

but if we go lowest common denominator for problems, CIC-based systems (Lean, Coq) would be disadvantaged compared to Isabelle automation, so what would be the incentive to compete directly?

To me, the only obvious way to attract CIC-based users would be with different "weight class" which includes problems where dependent types actually pay off

One goal of proof ground is to find out about the strengths and weaknesses of certain provers (Isabelle has no dependent types, :) what a learning), and also learn from each others.

see here for more about Coq platform: https://github.com/MSoegtropIMC/coq-platform/blob/master/charter.md

Important clarification about the platform is: it doesn't exist yet. But it likely will in time for the next Proof Ground.

I like the idea of weight classes, but proof ground rather tries to bring people together.

okay, I will come back to Coq platform then in a while. We will definitely announce in advance information about the next edition.

But in the mean time, our platform is free to use by any community. If you want to host a contest (involving problems tailored to CIC) feel free to. We will be happy to host it.

I think the strengths and weaknesses are quite well known in general. For example, let's say you have some problems with DFAs/NFAs. It is well known that this will force you to carry around a finiteness condition in Isabelle, and that it can be done easily with finite types in Coq without much extra work.

We can very much add a different domain than http://competition.isabelle.systems/ , if you prefer :)

It might also be interesting to have larger sizes of formalization. A scale where HOL and CIC will have significant differences in the definitions. You could reward people who solve it in two different provers too

I also find it interesting to think about how to measure the "value" of a proof, so not only that it goes through, but also its beauty, maybe in length, or whatever?

In the Isabelle zulip we started a stream about Proof Ground. I invite you to join the discussion there. It would be nice to share streams across zulip servers, is that possible?

I don't disagree with the competition concept in general, but having a competition means that winning is large part of the motivation. If problems are not suited to your logic, it will become a point of contention. As an analogy, intuitionistic and classical logic have completely different benchmark sets (ILTP vs. TPTP), because experts know you can't do well (win the competition) in both.

maximilian p.l. haslbeck said:

In the Isabelle zulip we started a stream about Proof Ground. I invite you to join the discussion there. It would be nice to share streams across zulip servers, is that possible?

There might be a way to create a bridge where a bot would cross-post messages to the other Zulip.

another analogy: some of my earlier colleagues regularly participated in programming competitions. In these competitions, one is not only judged for correctness (on a hidden test dataset) but also execution speed. Since compilation time doesn't count, everybody uses C++ and precomputes as much as possible with template metaprogramming. Anyone who doesn't use C++ will always lose, so nobody uses Java, Python, etc.

If it’s true that an Isabelle-organized competition is necessarily biased, even if unintentionally, some combo of <joining the organization, or organizing another one> might be a good fix. @maximilian p.l. haslbeck is even offering the platform they built for this, so kudos I guess :-)

in practice, I guess you want some form of peer-review for the competition criteria, probably by co-organizing (or maybe in an actual paper, but I am not sure that’s realistic)

Karl Palmskog said:

maximilian p.l. haslbeck but generally, how can one have an even playing field for such different logics? As I wrote earlier in the thread, depending on the problem, either Isabelle/HOL+Sledgehammer may have an incredible advantage or a definition may be impossible to express at all in HOL

Certainly if HOL is crushing Coq on these problems, it should be considered as an open challenge for Coq tooling to improve. Which it has, of course, dramatically, but it's good to have the side-by-side. I don't think the expressiveness gap is that huge.

Cody Roux said:

Certainly if HOL is crushing Coq on these problems, it should be considered as an open challenge for Coq tooling to improve. Which it has, of course, dramatically, but it's good to have the side-by-side. I don't think the expressiveness gap is that huge.

Are you familiar with how Sledgehammer/Metis works and the HOL foundations? Functions in HOL don't have to be computable or fully defined, and classical logic and classical choice is *built into* Isabelle automation. CoqHammer, in contrast, is designed around intuitionistic logic from the start. Would you ask a classical and intuitionistic mathematician to compete to prove the same theorems in real analysis?

with a single exception: a certain theorem about arbitrary sets and functions which was significantly easier to prove classically, and so advantaged provers based on classical logic. Even adding EM as an axiom only marginally helps, since then one needs to prove lemmas based on it, losing precious time.

It's entirely fair to have a separate category for intuitionistic theorems; those are stronger. I'm not sure whether Coq _needs_ to be worse for classical proofs...

the eval for the new `sauto`

tactic for CoqHammer suggests it was around twice as good on intuitionistic problems as classical ones

Shameless plug `ω+1`

: in Lean, the in mathlib library, the axiom of choice and excluded middle is ubiquitous. It does make for a very different experience to using Coq. It's just a matter of having the relevant theorems and the automation being able to use them

Is that a matter of focus/design choices of `sauto`

, or does it have to do with Coq itself? I expect both.

Hm, how did Lean do the first time? I don't see results under https://www21.in.tum.de/~wimmers/proofground2019/

Floris and I came in second

But still, Manuel and Peter left us in the dust

There's a section Results where we're listed:

Results

Proof Ground 2019 is over! The winning teams are:

- Manuel Eberl and Peter Lammich (Isabelle/HOL)
- Floris van Doorn and Simon Hudon (Lean)
- Fabian Kunze (Coq)

doh, I'm blind

Isabelle and Coq are extremely different, but there are fewer technical reasons why Coq couldn't compete with Lean on classical logic. It's also fine if Coq doesn't care for it.

(OTOH, I thought suggesting that Lean could be better than Coq at classical logic was very unwelcome).

I think as interesting as the ranking is, it feels like it's more about people's mastery of their tool than how good the tool is. Floris and I are mathlib maintainers, Peter and Manuel are big Isabelle contributors as well and I feel like a lot of the people there might have suffered from the handicap of "not using the tool every waking moment of every day"

Paolo G. Giarrusso said:

(OTOH, I thought suggesting that Lean could be better than Coq at classical logic was very unwelcome).

Do you feel like that's what I was saying? My point was that the Lean culture and the Coq culture are very different. People in Coq are very reluctant to rely on axioms whereas we barely notice in Lean when the axiom of choice or excluded middle are involved in proof

No no, sorry, that wasn't about you!

Oh, sorry. I'd like to know what you meant

[I clarified my subtweet separately]

On "mastery of the tool", the full ranking seems indeed suggestive — there are Isabelle users both at the top and at the bottom, so it's not just the tool indeed

https://www21.in.tum.de/~wimmers/proofground/slides/results.html#/final-leaderboard

Paolo G. Giarrusso said:

It's entirely fair to have a separate category for intuitionistic theorems; those are stronger. I'm not sure whether Coq _needs_ to be worse for classical proofs...

Not to belabor the point, or extend this discussion beyond its natural ending, but I feel that if you want to be classical, you need to embrace it like Isabelle or Lean, in the kernel, but mostly the library and the "hammer" tactics. Otherwise you're starting with a significant handicap. E.g. TAUT is Co-NP complete, whereas intuitionistic propositional logic is PSPACE complete. Also in practice, we're just better at ATP in classical settings.

One caveat would be that often, things are decidable, so you can act as if you had the excluded middle all along. ssreflect seems to adopt this philosophy, and that might be enough.

Honestly I hope some SSR hackers are going to participate in the next proof grounds.

What does Lean have for classical stuff in the kernel?

Last updated: Mar 03 2024 at 15:01 UTC