I think it may be time to document and perform some definitive ecosystem measurements, so it becomes less subjective.
Yes, I also think that this is called for. I think for a small number of languages, like Coq, HOL/Isabelle and LEAN it would be feasible to also take language specific eco system information into account - something which is not possible when comparing 100s of languages.
So how do we split up the work?
I would start by just trying to figure out what data sources to use. Quite a few big projects are not on GitHub
For Coq how about inspecting the Coq opam archive? Are you aware of subsnatial developments which are not in opam?
union of extra-dev and released? Even then, tons and tons of Coq code will be missing
as I think we have discussed before, there is no obvious incentive to add a project to released, and few know about extra-dev
another angle is looking at formally published and grey literature that references Coq artifacts.
another angle is looking at formally published and grey literature that references Coq artifacts.
So you start with conference proceedings and crawl through the papers? I wonder how this could be done without getting too expensive.
at least this is reproducible over time, given keywords and publication databases and crawling procedure
GitHub crawling has similar issues, but is another option (also GitLab crawling?)
to take one completely random example, here is a repo with ~14k LOC Coq code not in opam (despite all its dependencies having releases in released
): https://github.com/Blaisorblade/dot-iris (ping @Paolo Giarrusso ) :wink:
Indeed — it even has an opam
package (https://github.com/Blaisorblade/dot-iris/blob/master/opam) and a Zenodo artifact (there’s a badge in the README!) so I could probably add it, except I wouldn’t think it _should_ be in an opam repository.
if there’s a place appropriate for “Coq artifacts of random POPL/... papers”, I can submit it there.
in fact, the amount of packaging work is likely overkill, compared to predecessor work (https://github.com/TiarkRompf/minidot and https://github.com/samuelgruetter/dot-calculus) — PL formalizations seem intrinsically next to impossible to reuse for typical PL work, since usually no two papers work on the same syntax (CompCert and VST being exceptions)
the opam repo is not only for reuse, but also for studying ecosystem size, machine learning datasets, etc. There is no special section/category for POPL artifacts, all Coq code releases can be added (Computer Science/Semantics and Compilation/Semantics
would probably be appropriate as category for dot-iris)
but as I suspected it would, this illustrates my point about released
being a small fraction of the full Coq ecosystem
if we were indexed by dblp and had an ISSN and had refereeing of submissions and a citation style, I believe things would be different...
_I_ would fear if opam scales to so many packages, but I don't mind. But even for Iris, https://github.com/coq/opam-coq-archive/tree/master/released/packages has a subset of https://gitlab.mpi-sws.org/iris/opam/-/tree/master/packages (only iris, iris-string-ident and stdpp, + autosubst), which is a subset of the Iris ecosystem.
I'd suggest to fix the policies on https://github.com/coq/opam-coq-archive, except there's nothing on this question.
crawling the literature seems easier, but you'd need to do it for Isabelle as well?
fix what policies? The dblp + reviewing thing will not happen. If something in this direction happened, it would be an overlay of opam-coq-archive, like we want the Platform to be
nah, coq-released could just say these packages are welcome. Unlikely to make much difference.
I mean, everybody is working in their small domain, so if we say <domain X> is welcome to submit packages, that will be a long list
the thing about Isabelle is that most code outside of AFP is effectively dead and not likely to be revived. To my knowledge, the documentation of how to port from say, Isabelle-2010 to Isabelle-2020 is marginal
"packages are welcome even for paper artifacts" or "for formalizations that aren't meant to be depended upon".
OTOH, if we're talking of paper artifacts, "dead" does not seem a real problem, as long as VMs exist.
well, the project is dead in the sense it will (likely) not be updated by its maintainer/creator. There is a different dead, which might be: "impossible to use/inspect".
Hence the VMs. If you can't just install Isabelle 2010 (?)
for the released repo, probably something more like this?
We welcome pull requests for all Coq-related packages that are compatible with a released version of Coq. Besides libraries of general interest, this also includes paper artifacts and other specialized formalizations that are not expected to be immediately reusable by others.
@Paolo Giarrusso so would that be enough to entice you to submit?
sounds good; my track record is bad, but I'll try to submit this weekend. I'd maybe test that next on @Ralf Jung or @Robbert Krebbers to submit their other packages (they at least have opam packages).
I guess I should also add some fluff about how the archive should represent and enable the diversity of the Coq ecosystem, etc., etc.
to get significantly more submissions, you might need some call to the community — would getting more packages help survival/maintenance of Coq? That'd be a great reason
in essence, this is about funding, competitiveness, enabling research and so on. Not exactly survival, but that would be a good story.
your lack of packages is killing Coq development!
Makes sense. But while the "we welcome" paragraph is fine for its repo, wouldn't you also want a more explicit "call to action for the community" elsewhere (discourse? coq-club? TYPES?)? Unless there's "sensitive" reasons not to :-).
(I'd consider also involving other communities to get a count people would consider fair — also not my call, and I don't have skin in this game these days)
if we do some "call for packages", this will be a separate thing from ecosystem measurements
cynically, what is needed for successful call is (1) leveraging famous people's names and (2) leveraging social proof
do what FAMOUS CS PROFESSOR, WINNER OF AWARD does, submit YOUR package to the archive for the 2013 WINNER OF ACM SOFTWARE SYSTEM AWARD, the NOBEL PRIZE OF SOFTWARE
Coming late to the discussion, but indeed the opam repository is also there for paper artifacts I think
Indeed, as we continuously test that packages still compile, and fix the Coq version dependency if needed, it can be viewed as a way to provide a VM for packages
To make sure that something which built at some point will still build in a few years
One can view the opam repository as a "library" or "musem" to preserve proofs, in addition of a way to share active developments I guess
Probably this should be advertised more in this sense then
I plan to do such advertising in the Coq Platform soon. The idea is that the Coq Platform allows to install old versions of Coq and Platform components in parallel to a current one (in a separate opam switch). The main point of this is to make it easier to dig into older artifacts and/or port them to the latest version of Coq. I plan to go back to when I started with the Coq Windows installer (8.6).
(and Coq Platform is fully opam based - which is also advertised :-)
In the future the Coq Platform might become a common base for paper artifacts.
Michael Soegtrop said:
For Coq how about inspecting the Coq opam archive? Are you aware of subsnatial developments which are not in opam?
To add a data point: most of the projects I work with are not on OPAM; it's probably in the order of hundreds of thousands of lines of Coq code. But they're all on Github.
@Clément Pit-Claudel : can you post 10 or 20 links to projects which are in GitHub but not in opam? It would be interesting to see if GitHub language queries find them all.
Do you think we should work harder to bring these projects into opam?
@Michael Soegtrop: A GitHub search query will never give you back more than 1000 results. So I doubt you can find them all with that approach.
Do you think we should work harder to bring these projects into opam?
IMHO a serious issue with having opam as a central reference for all Coq code is that too many people are just not interested in making releases of their projects.
Couldn't we automate their distribution as opam packages in a specific repo "coq-bazar" or similar ? For open-source projects on github I guess there is no IP issue.
But finding out which Coq version to use might be a challenge
@Théo Zimmermann : there is a programming language usage statistics from RedMonk (https://redmonk.com/sogrady/2021/03/01/language-rankings-1-21/) - they do it somehow in this way. I think I have once seen the scripts online ... Coq is at about 20% quantille Github and 40% quantille Stackoverflow.
Oh, I wasn't saying that you cannot retrieve all public data from GitHub, just that using GitHub search is not the right way to do this.
In the case of RedMonk, I don't know how they do it, because if they only care about the number of repositories, there might be an easy way (and GitHub search might be a way, as it will give you the total number of results even if you only get access to the first 1000 results).
FWIW, there is something called GHTorrent which archives most of GitHub activity data for research purposes.
@Théo Zimmermann : well I guess you can do bisection tricks - say combine the query with a project creation date, name, ... so that the number of results gets below 1000 and keep bisecting. I guess by project creation date would work well.
There are a few other similar datasets.
Michael Soegtrop said:
Théo Zimmermann : well I guess you can do bisection tricks - say combine the query with a project creation date, name, ... so that the number of results gets below 1000 and keep bisecting. I guess by project creation date would work well.
Indeed, this is the kind of tricks I used to look for community organization in my PhD thesis.
So you already have the scripts :-)
See the explanations at the end of 7.4.4 in https://www.irif.fr/_media/users/theo/memoirthesis.pdf.
And all my scripts are available at https://github.com/Zimmi48/thesis
In this case, this is the one: https://mybinder.org/v2/gh/Zimmi48/thesis/master?filepath=notebooks%2Fcommunity-orgs%2FCommunity_organizations.ipynb
We can also just ask people we know at MSR now that Microsoft owns Github. They might be able to give us the data for research purposes?
Just so you know, GitHub is pretty open to the use of their data for research purposes and it's already used a lot in many empirical software engineering papers. Besides, GH Torrents, there are a few other compiled GitHub datasets hosted externally.
Last updated: Dec 07 2023 at 06:38 UTC