Stream: Miscellaneous

Topic: Measuring proof assistant ecosystems


view this post on Zulip Michael Soegtrop (Jan 04 2021 at 11:42):

I think it may be time to document and perform some definitive ecosystem measurements, so it becomes less subjective.

Yes, I also think that this is called for. I think for a small number of languages, like Coq, HOL/Isabelle and LEAN it would be feasible to also take language specific eco system information into account - something which is not possible when comparing 100s of languages.

So how do we split up the work?

view this post on Zulip Karl Palmskog (Jan 04 2021 at 11:50):

I would start by just trying to figure out what data sources to use. Quite a few big projects are not on GitHub

view this post on Zulip Michael Soegtrop (Jan 04 2021 at 14:05):

For Coq how about inspecting the Coq opam archive? Are you aware of subsnatial developments which are not in opam?

view this post on Zulip Karl Palmskog (Jan 04 2021 at 14:56):

union of extra-dev and released? Even then, tons and tons of Coq code will be missing

view this post on Zulip Karl Palmskog (Jan 04 2021 at 14:57):

as I think we have discussed before, there is no obvious incentive to add a project to released, and few know about extra-dev

view this post on Zulip Karl Palmskog (Jan 04 2021 at 15:07):

another angle is looking at formally published and grey literature that references Coq artifacts.

view this post on Zulip Michael Soegtrop (Jan 04 2021 at 17:24):

another angle is looking at formally published and grey literature that references Coq artifacts.

So you start with conference proceedings and crawl through the papers? I wonder how this could be done without getting too expensive.

view this post on Zulip Karl Palmskog (Jan 04 2021 at 17:36):

at least this is reproducible over time, given keywords and publication databases and crawling procedure

view this post on Zulip Karl Palmskog (Jan 04 2021 at 17:37):

GitHub crawling has similar issues, but is another option (also GitLab crawling?)

view this post on Zulip Karl Palmskog (Jan 05 2021 at 18:36):

to take one completely random example, here is a repo with ~14k LOC Coq code not in opam (despite all its dependencies having releases in released): https://github.com/Blaisorblade/dot-iris (ping @Paolo Giarrusso ) :wink:

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 19:08):

Indeed — it even has an opam package (https://github.com/Blaisorblade/dot-iris/blob/master/opam) and a Zenodo artifact (there’s a badge in the README!) so I could probably add it, except I wouldn’t think it _should_ be in an opam repository.

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 19:10):

if there’s a place appropriate for “Coq artifacts of random POPL/... papers”, I can submit it there.

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 19:14):

in fact, the amount of packaging work is likely overkill, compared to predecessor work (https://github.com/TiarkRompf/minidot and https://github.com/samuelgruetter/dot-calculus) — PL formalizations seem intrinsically next to impossible to reuse for typical PL work, since usually no two papers work on the same syntax (CompCert and VST being exceptions)

view this post on Zulip Karl Palmskog (Jan 05 2021 at 19:32):

the opam repo is not only for reuse, but also for studying ecosystem size, machine learning datasets, etc. There is no special section/category for POPL artifacts, all Coq code releases can be added (Computer Science/Semantics and Compilation/Semantics would probably be appropriate as category for dot-iris)

view this post on Zulip Karl Palmskog (Jan 05 2021 at 19:33):

but as I suspected it would, this illustrates my point about released being a small fraction of the full Coq ecosystem

view this post on Zulip Karl Palmskog (Jan 05 2021 at 19:37):

if we were indexed by dblp and had an ISSN and had refereeing of submissions and a citation style, I believe things would be different...

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 19:38):

_I_ would fear if opam scales to so many packages, but I don't mind. But even for Iris, https://github.com/coq/opam-coq-archive/tree/master/released/packages has a subset of https://gitlab.mpi-sws.org/iris/opam/-/tree/master/packages (only iris, iris-string-ident and stdpp, + autosubst), which is a subset of the Iris ecosystem.

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 19:39):

I'd suggest to fix the policies on https://github.com/coq/opam-coq-archive, except there's nothing on this question.

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 19:40):

crawling the literature seems easier, but you'd need to do it for Isabelle as well?

view this post on Zulip Karl Palmskog (Jan 05 2021 at 19:40):

fix what policies? The dblp + reviewing thing will not happen. If something in this direction happened, it would be an overlay of opam-coq-archive, like we want the Platform to be

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 19:41):

nah, coq-released could just say these packages are welcome. Unlikely to make much difference.

view this post on Zulip Karl Palmskog (Jan 05 2021 at 19:42):

I mean, everybody is working in their small domain, so if we say <domain X> is welcome to submit packages, that will be a long list

view this post on Zulip Karl Palmskog (Jan 05 2021 at 19:45):

the thing about Isabelle is that most code outside of AFP is effectively dead and not likely to be revived. To my knowledge, the documentation of how to port from say, Isabelle-2010 to Isabelle-2020 is marginal

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 21:23):

"packages are welcome even for paper artifacts" or "for formalizations that aren't meant to be depended upon".

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 21:24):

OTOH, if we're talking of paper artifacts, "dead" does not seem a real problem, as long as VMs exist.

view this post on Zulip Karl Palmskog (Jan 05 2021 at 21:25):

well, the project is dead in the sense it will (likely) not be updated by its maintainer/creator. There is a different dead, which might be: "impossible to use/inspect".

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 21:28):

Hence the VMs. If you can't just install Isabelle 2010 (?)

view this post on Zulip Karl Palmskog (Jan 05 2021 at 21:33):

for the released repo, probably something more like this?

We welcome pull requests for all Coq-related packages that are compatible with a released version of Coq. Besides libraries of general interest, this also includes paper artifacts and other specialized formalizations that are not expected to be immediately reusable by others.

view this post on Zulip Karl Palmskog (Jan 05 2021 at 21:39):

@Paolo Giarrusso so would that be enough to entice you to submit?

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 21:43):

sounds good; my track record is bad, but I'll try to submit this weekend. I'd maybe test that next on @Ralf Jung or @Robbert Krebbers to submit their other packages (they at least have opam packages).

view this post on Zulip Karl Palmskog (Jan 05 2021 at 21:45):

I guess I should also add some fluff about how the archive should represent and enable the diversity of the Coq ecosystem, etc., etc.

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 21:45):

to get significantly more submissions, you might need some call to the community — would getting more packages help survival/maintenance of Coq? That'd be a great reason

view this post on Zulip Karl Palmskog (Jan 05 2021 at 21:47):

in essence, this is about funding, competitiveness, enabling research and so on. Not exactly survival, but that would be a good story.

view this post on Zulip Karl Palmskog (Jan 05 2021 at 21:48):

your lack of packages is killing Coq development!

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 21:57):

Makes sense. But while the "we welcome" paragraph is fine for its repo, wouldn't you also want a more explicit "call to action for the community" elsewhere (discourse? coq-club? TYPES?)? Unless there's "sensitive" reasons not to :-).

view this post on Zulip Paolo Giarrusso (Jan 05 2021 at 21:58):

(I'd consider also involving other communities to get a count people would consider fair — also not my call, and I don't have skin in this game these days)

view this post on Zulip Karl Palmskog (Jan 05 2021 at 21:59):

if we do some "call for packages", this will be a separate thing from ecosystem measurements

view this post on Zulip Karl Palmskog (Jan 05 2021 at 22:06):

cynically, what is needed for successful call is (1) leveraging famous people's names and (2) leveraging social proof

view this post on Zulip Karl Palmskog (Jan 05 2021 at 22:08):

do what FAMOUS CS PROFESSOR, WINNER OF AWARD does, submit YOUR package to the archive for the 2013 WINNER OF ACM SOFTWARE SYSTEM AWARD, the NOBEL PRIZE OF SOFTWARE

view this post on Zulip Guillaume Claret (Apr 11 2021 at 12:01):

Coming late to the discussion, but indeed the opam repository is also there for paper artifacts I think

view this post on Zulip Guillaume Claret (Apr 11 2021 at 12:02):

Indeed, as we continuously test that packages still compile, and fix the Coq version dependency if needed, it can be viewed as a way to provide a VM for packages

view this post on Zulip Guillaume Claret (Apr 11 2021 at 12:03):

To make sure that something which built at some point will still build in a few years

view this post on Zulip Guillaume Claret (Apr 11 2021 at 12:05):

One can view the opam repository as a "library" or "musem" to preserve proofs, in addition of a way to share active developments I guess

view this post on Zulip Guillaume Claret (Apr 11 2021 at 12:05):

Probably this should be advertised more in this sense then

view this post on Zulip Michael Soegtrop (Apr 12 2021 at 07:41):

I plan to do such advertising in the Coq Platform soon. The idea is that the Coq Platform allows to install old versions of Coq and Platform components in parallel to a current one (in a separate opam switch). The main point of this is to make it easier to dig into older artifacts and/or port them to the latest version of Coq. I plan to go back to when I started with the Coq Windows installer (8.6).

view this post on Zulip Michael Soegtrop (Apr 12 2021 at 07:42):

(and Coq Platform is fully opam based - which is also advertised :-)

view this post on Zulip Michael Soegtrop (Apr 12 2021 at 07:43):

In the future the Coq Platform might become a common base for paper artifacts.

view this post on Zulip Clément Pit-Claudel (Apr 13 2021 at 04:34):

Michael Soegtrop said:

For Coq how about inspecting the Coq opam archive? Are you aware of subsnatial developments which are not in opam?

To add a data point: most of the projects I work with are not on OPAM; it's probably in the order of hundreds of thousands of lines of Coq code. But they're all on Github.

view this post on Zulip Michael Soegtrop (Apr 13 2021 at 07:13):

@Clément Pit-Claudel : can you post 10 or 20 links to projects which are in GitHub but not in opam? It would be interesting to see if GitHub language queries find them all.

Do you think we should work harder to bring these projects into opam?

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 07:45):

@Michael Soegtrop: A GitHub search query will never give you back more than 1000 results. So I doubt you can find them all with that approach.

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 07:46):

Do you think we should work harder to bring these projects into opam?

IMHO a serious issue with having opam as a central reference for all Coq code is that too many people are just not interested in making releases of their projects.

view this post on Zulip Matthieu Sozeau (Apr 13 2021 at 08:46):

Couldn't we automate their distribution as opam packages in a specific repo "coq-bazar" or similar ? For open-source projects on github I guess there is no IP issue.

view this post on Zulip Matthieu Sozeau (Apr 13 2021 at 08:46):

But finding out which Coq version to use might be a challenge

view this post on Zulip Michael Soegtrop (Apr 13 2021 at 09:09):

@Théo Zimmermann : there is a programming language usage statistics from RedMonk (https://redmonk.com/sogrady/2021/03/01/language-rankings-1-21/) - they do it somehow in this way. I think I have once seen the scripts online ... Coq is at about 20% quantille Github and 40% quantille Stackoverflow.

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 09:10):

Oh, I wasn't saying that you cannot retrieve all public data from GitHub, just that using GitHub search is not the right way to do this.

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 09:11):

In the case of RedMonk, I don't know how they do it, because if they only care about the number of repositories, there might be an easy way (and GitHub search might be a way, as it will give you the total number of results even if you only get access to the first 1000 results).

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 09:12):

FWIW, there is something called GHTorrent which archives most of GitHub activity data for research purposes.

view this post on Zulip Michael Soegtrop (Apr 13 2021 at 09:12):

@Théo Zimmermann : well I guess you can do bisection tricks - say combine the query with a project creation date, name, ... so that the number of results gets below 1000 and keep bisecting. I guess by project creation date would work well.

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 09:12):

There are a few other similar datasets.

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 09:13):

Michael Soegtrop said:

Théo Zimmermann : well I guess you can do bisection tricks - say combine the query with a project creation date, name, ... so that the number of results gets below 1000 and keep bisecting. I guess by project creation date would work well.

Indeed, this is the kind of tricks I used to look for community organization in my PhD thesis.

view this post on Zulip Michael Soegtrop (Apr 13 2021 at 09:14):

So you already have the scripts :-)

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 09:14):

See the explanations at the end of 7.4.4 in https://www.irif.fr/_media/users/theo/memoirthesis.pdf.

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 09:14):

And all my scripts are available at https://github.com/Zimmi48/thesis

view this post on Zulip Théo Zimmermann (Apr 13 2021 at 09:15):

In this case, this is the one: https://mybinder.org/v2/gh/Zimmi48/thesis/master?filepath=notebooks%2Fcommunity-orgs%2FCommunity_organizations.ipynb

view this post on Zulip Clément Pit-Claudel (Apr 14 2021 at 07:37):

We can also just ask people we know at MSR now that Microsoft owns Github. They might be able to give us the data for research purposes?

view this post on Zulip Théo Zimmermann (Apr 14 2021 at 07:40):

Just so you know, GitHub is pretty open to the use of their data for research purposes and it's already used a lot in many empirical software engineering papers. Besides, GH Torrents, there are a few other compiled GitHub datasets hosted externally.


Last updated: Aug 14 2022 at 12:03 UTC