I have a strange effect. The following file:
https://gitlab.inria.fr/gappa/gappa/-/archive/f53e105cd73484fc76eb58ba24ead73be502c608.tar.gz
seems to have changed its hash code. When I download it with wget and run openssl dgst -sha512
on the download I used to get
b2c04d87b502fcab24573b6030e1ba3e2bd9cc7ae719367d12785e8bd91e43001f8d64e70d5ae515ddd9ec636d9c1ce89b54b18ae0796955b5c97b71fee5c957
but now I get
22b82333d0e135578843dcb0740a68a364a21c24dc061e9645ace2353fb8a9141722e5b4691d32d6c433064e3a393ee7ee44435ecd3de936ecb4901498f539de
I am quite sure about this because I get CI errors from this (the old hash is in the opam package for gappa) and this used to work. E.g. the opam package with the old hash did pass opam's CI which I think is not possible when the hash is wrong.
If someone has a copy of the file f53e105cd73484fc76eb58ba24ead73be502c608.tar.gz
(running the coq platform v8.13 script should download it but might delete it during a cleanup) with the old hash code (starting with b2c
) around, please send it to me. I would like to understand how this could happen.
It might just be due to an update of git
on the server. For example, up to a few months ago, git archive
was incorrectly parsing compression level on the command line (e.g., -13
was parsed as -3
).
Yes, I though about this, but then if something like this could happen, it should have shown up in opam before since most opam packages reference tar.gz files with a sha hash sum. So I would really like to analyze the root cause. It might also be that I did a mistake, but then I don't understand how this could have slipped all the layers of CI we have (opam, Coq Platform, ...).
I just grepped the whole released
archive, and there are only 4 versions (among 1750 ones) which are downloading an archive pointed by a sha sum. And they were all uploaded by you. So, "most opam packages" is actually a very subjective notion here.
I guess we are not talking about the same thing here. Pretty much all of the release coq opam files (I count 1376 out of 1411) refer to a tar.gz file and use a checksum (either md5, sha256 or sha512). What I am talking about is that this checksum changes although the .tar.gz URL remained the same.
I guess you are refering to URLs beeing themselves commit hash references. I agree that this is bad practice - I sometimes do this when I need a patch on top of a tag or maybe in error - but this doesn't have anything to do with the problem at hand.
It has everything to do with the problem at hand. You are complaining that https://gitlab.inria.fr/gappa/gappa/-/archive/f53e105cd73484fc76eb58ba24ead73be502c608.tar.gz
, which is a file referred by a commit hash, has a different checksum than before. This file is generated on the fly by the server, so it depends on the version of gitlab, git archive, zlib, tar, and a few other packages. So, you cannot expect its checksum to stay the same over time, even if the files it contains have not changed.
So your assumption is that if I use a tag instead of a hash the file is not created on the fly? Say
https://github.com/math-comp/math-comp/archive/mathcomp-1.12.0.tar.gz
Do you have some evidence for this assumption? It is imaginable that github stores teh archives for tags, but I don't find it very plausible that there is a difference between using a tag and a commit hash.
I understand that there are some references to apparently static tar files, e.g.
https://gforge.inria.fr/frs/download.php/file/38383/coquelicot-3.2.0.tar.gz
but I would say this is a minortiy. Most opam packages (in Coq released and opam main) refer to git tar files for tags.
My assumption is that archives for tags are cached, and archives for random commits are not (or with a very short lifetime).
This is possible, but I still don't buy this. E.g. it is a good question if such caches would survive a server reconfiguration which changes the cache contents.
Anyway I will review all packages with hashes if they can be replaced with tags.
Re "majority of packages”, is it valid to use GitHub release tarballs in opam?
I suppose at least GitHub releases are tested to preserve their hash at the byte level, since pretty much the entire Internet relies on that, but I’d never wondered.
Hard to tell if releases, tags and commits are handled differently. E.g. the URLs of releases is not different than the URL of tags. Maybe github does something special like storing release tar balls, maybe they have CI to test that the hashes stay constant, maybe it is just luck that nothing happened so far.
Paolo Giarrusso said:
I suppose at least GitHub releases are tested to preserve their hash at the byte level, since pretty much the entire Internet relies on that, but I’d never wondered.
Is that so? I would have thought otherwise. For instance, as far as I know, Npm.js does not bother with the checksums of Github releases, since the source files are not coming from there anyway. Same for Debian, Fedora, and so on. Even for Opam, the source files are now hosted on Opam's servers. (So, you are actually checking that the server is sending you the original file, not that the current one still has the same checksum.)
@Guillaume Melquiond : Unfortunately for Windows compatibility reasons I am bound to opam 2.0.7 with the platform, which seems to download the original sources. Do you know in which version the caching you mentioned was introduced?
It has been there for a very long time. (Note that I am talking about the official Opam server, not the one we use for Coq. We have not enabled the cache for it.) For example, the file you were talking about should be available as https://opam.ocaml.org/cache/sha512/b2/b2c04d87b502fcab24573b6030e1ba3e2bd9cc7ae719367d12785e8bd91e43001f8d64e70d5ae515ddd9ec636d9c1ce89b54b18ae0796955b5c97b71fee5c957
or something akin to it.
@Guillaume Melquiond : cool thanks!
The difference between the old and new tar.gz file for gappa is that the old one does contain .gitignore files, the new one doesn't. Otherwise they are content wise identical - hard to tell if e.g. compression is different. The two URLs are:
new: https://gitlab.inria.fr/gappa/gappa/-/archive/f53e105cd73484fc76eb58ba24ead73be502c608.tar.gz
old: https://opam.ocaml.org/cache/sha512/b2/b2c04d87b502fcab24573b6030e1ba3e2bd9cc7ae719367d12785e8bd91e43001f8d64e70d5ae515ddd9ec636d9c1ce89b54b18ae0796955b5c97b71fee5c957
Any idea what might have caused this? As you said a change in the INRIA gitlab server config?
Not sure. I added .gitignore
to .gitattributes
recently. Theoretically, git archive
is supposed to read the .gitattributes
file from the given commit. But, who knows what Gitlab does? Perhaps it uses the file from the default branch if the given commit does not provide it. Or perhaps it is using info/attribute
instead of .gitattributes
, which would certainly make things simpler for bare repositories.
I see - interesting. Well at least we have an explanation for why this happened to gappa and no other repo, although as you say one wouldn't expect that changing .gitattributes would affect previous commits. Do you think we should file a bug report to gitlab or the INRIA gitlab maintainers?
P.S.: I swapped the does / doesn't contain and had the wrong old URL in my previous post - I edited to be correct.
Last updated: Jun 03 2023 at 05:01 UTC