Stream: Coq devs & plugin devs

Topic: fiat crypto broken


view this post on Zulip Gaëtan Gilbert (Dec 02 2021 at 16:14):

https://gitlab.com/coq/coq/-/jobs/1842886515 and several other jobs
cc @Jason Gross

view this post on Zulip Jason Gross (Dec 02 2021 at 16:25):

The error message is

COQNATIVE /builds/coq/coq/_build_ci/fiat_crypto/rupicola/bedrock2/bedrock2/src/bedrock2Examples/lightbulb.vo
ocamlopt.opt got signal and exited
Error: Native compiler exited with status 2 (in case of stack overflow,
       increasing stack size (typicaly with "ulimit -s") often helps)

I guess the fix is to increase the stack size? (Native compilation works fine on our CI, I believe.) Once https://github.com/mit-plv/bedrock2/issues/205 is resolved, we'll stop depending on this file, and then this won't be an issue (but I'm not sure how long that'll take)

view this post on Zulip Pierre-Marie Pédrot (Apr 22 2022 at 11:12):

It seems that fiat-crypto is broken on master, see e.g. https://gitlab.com/coq/coq/-/jobs/2356260838

view this post on Zulip Pierre-Marie Pédrot (Apr 22 2022 at 11:12):

It dies with

File "src/ExtractionOCaml/saturated_solinas.ml", line 1:
Error: The file /root/.opamcache/4.13.0+flambda/lib/ocaml/unix.cmxa is not a compilation unit description
Command exited with non-zero status 2

view this post on Zulip Pierre-Marie Pédrot (Apr 22 2022 at 11:12):

any idea about what's going on? (cc @Jason Gross )

view this post on Zulip Jason Gross (Apr 22 2022 at 11:27):

We invoke ocamlfind ocamlopt -package unix -linkpkg -w -20 -g -o because we need Unix.gettimeofday. This sounds like the opam distribution is broken?

view this post on Zulip Guillaume Melquiond (Apr 22 2022 at 11:44):

This error message can only occur if the OCaml magic number is incorrect. So, three possibilities:

view this post on Zulip Jason Gross (Apr 22 2022 at 13:26):

We have

library:ci-fiat_crypto:
  extends: .ci-template-flambda
  needs:
  - build:edge+flambda
  - library:ci-coqprime
  - plugin:ci-bignums
  - plugin:ci-rewriter

# We cannot use flambda due to
# https://github.com/ocaml/ocaml/issues/7842, see
# https://github.com/coq/coq/pull/11916#issuecomment-609977375
library:ci-fiat_crypto_ocaml:
  extends: .ci-template
  needs:
  - build:edge+flambda
  - library:ci-coqprime
  - plugin:ci-bignums
  - plugin:ci-rewriter
  - library:ci-fiat_crypto

So it does seem that somehow we're crossing flambda vs not?

view this post on Zulip Jason Gross (Apr 22 2022 at 13:28):

(The ci-fiat_crypto_ocaml is just invoking ocaml on some ml files that were produced by the fiat crypto step, so what should happen is that we do all the Coq stuff on edge+flambda, upload the artifacts, and then in library:ci-fiat_crypto_ocaml, set up the system to have the fiat-crypto generated files around, and also to have the non-flambda edge ocaml being the default/installed one.)

view this post on Zulip Guillaume Melquiond (Apr 22 2022 at 13:37):

Indeed, the fact that one extends ci-template-flambda while the other extends ci-template seems a bit dubious. (But I am not sure it suffices to explain the issue.)

view this post on Zulip Guillaume Melquiond (Apr 22 2022 at 13:48):

So, yes, it suffices to explain the issue. The job is run inside the 4.05.0 switch, but it uses the libraries from the 4.13.0+flambda switch.

view this post on Zulip Guillaume Melquiond (Apr 22 2022 at 13:57):

More precisely, Fiat Crypto asks Coq for OCAMLFIND. So, Coq answers with the one from the 4.13.0+flambda switch, which is obviously incompatible with the compiler from the current switch, i.e., 4.05.0.

view this post on Zulip Jason Gross (Apr 22 2022 at 13:57):

Ah, i guess that's a change I made recently. I'll go back to using the one in path.

view this post on Zulip Guillaume Melquiond (Apr 22 2022 at 13:59):

(deleted)

view this post on Zulip Jason Gross (Apr 22 2022 at 14:05):

https://github.com/coq/coq/pull/15940

view this post on Zulip Gaëtan Gilbert (Oct 05 2022 at 18:14):

@Jason Gross lightbulb is back, eg https://gitlab.com/coq/coq/-/jobs/3117535414

view this post on Zulip Gaëtan Gilbert (Oct 05 2022 at 18:16):

is running native for fiat-crypto really useful? I propose we disable it and save some runner time

view this post on Zulip Emilio Jesús Gallego Arias (Oct 05 2022 at 23:06):

The setup mixing the two compiler switches is actually not supported, what we have now is pretty brittle IMHO

view this post on Zulip Jason Gross (Oct 06 2022 at 00:38):

fiat-crypto explicitly invokes native_compute in some places where it is much faster (2x? 10x? I don't quite recall) than vm_compute. I've opened https://github.com/mit-plv/bedrock2/issues/282, I think the solution we want is to build bedrock2_ex with native compute on-demand. Can you remind me what has to be done to suppress coqnative from our side, only on bedrock2?

view this post on Zulip Jason Gross (Oct 06 2022 at 00:39):

@Emilio Jesús Gallego Arias if you're referring to depending on different compilers for fiat_crypto and fiat_crypto_ocaml, that's not related to this issue.

view this post on Zulip Jason Gross (Oct 06 2022 at 00:40):

Alternatively, making all of fiat-crypto be native-compute on demand seems fine

view this post on Zulip Michael Soegtrop (Oct 06 2022 at 08:35):

Interestingly it does work in Coq Platform, which does not enable native compute ...

view this post on Zulip Karl Palmskog (Oct 06 2022 at 08:37):

doesn't native_compute just fall back on vm_compute or similar if native_compute is not enabled?

view this post on Zulip Gaëtan Gilbert (Oct 06 2022 at 08:38):

yes

view this post on Zulip Gaëtan Gilbert (Oct 23 2022 at 15:44):

recent example https://gitlab.com/coq/coq/-/jobs/3214365920

COQC src/Curves/Weierstrass/AffineProofs.v
Finished transaction in 0.812 secs (0.804u,0.007s) (successful)
Command terminated by signal 9
src/Curves/Weierstrass/AffineProofs.vo (real: 70.57, user: 2.35, sys: 0.33, mem: 1336288 ko)
Makefile.coq:792: recipe for target 'src/Curves/Weierstrass/AffineProofs.vo' failed

view this post on Zulip Jason Gross (Oct 23 2022 at 16:28):

Why is it being mem-killed at 1336288 ko?

view this post on Zulip Jason Gross (Oct 23 2022 at 16:29):

Maybe it's a problem with par:, try setting COQEXTRAFLAGS="-async-proofs-j 1" in dev/ci/ci-fiat_crypto.sh?

view this post on Zulip Guillaume Melquiond (Oct 24 2022 at 06:26):

Jason Gross said:

Why is it being mem-killed at 1336288 ko?

One of the heuristics of the OOM killer is that each child process contributes half its footprint toward the footprint of its parent process. So, if the main Coq process uses 1GB and there are three children of 2GB each, then the main process actually weights 4GB. So, it will be the first to die, even if it isn't the worst offender. (But, in a sense, it is the worst offender. It should never have created such large children in the first place. The kernel can be quite vindictive when memory becomes sparse.)

view this post on Zulip Théo Zimmermann (Oct 26 2022 at 10:15):

FTR I've recently seen a different type of failure in fiat-crypto:

File "./src/Curves/Weierstrass/AffineProofs.v", line 49, characters 4-13:
Error:  (in proof commutative_group): Attempt to save an incomplete proof

Command exited with non-zero status 1
src/Curves/Weierstrass/AffineProofs.vo (real: 74.58, user: 128.66, sys: 1.81, mem: 1424652 ko)

(https://github.com/coq/coq/runs/9102466669)

view this post on Zulip Jason Gross (Oct 26 2022 at 12:21):

@Théo Zimmermann That is another avatar of the same issue (but is possibly worth separately reporting as a bug in Coq?). When a par: worker process is OOM-killed (sometimes? always?), without the main Coq process being OOM-killed, the main Coq process believes the worker completed successfully but did not solve any goals. Hence the Qed fails with "Attempt to save an incomplete proof".

view this post on Zulip Jason Gross (Oct 26 2022 at 12:21):

I'll report this

view this post on Zulip Jason Gross (Oct 26 2022 at 12:25):

Reported at https://github.com/coq/coq/issues/16732


Last updated: Feb 05 2023 at 23:30 UTC