Hi folks, Kate (from OCaml fame) has requested if we could patch Coq to support OCaml 5.0, I think it is pretty easy, we need to:
CObj.set_tag
module (or even nativelib, pick your poison) using a dune rule for OCaml 5.0 so Coq actually compilesI don't have time this week, but seems OCaml 5 is being released in 2 week so help is welcome.
See also https://github.com/coq/coq/pull/15494
there are some other issues as well: https://github.com/coq/coq/projects/55
Just passing on, Kate (mentioned in this topic before, one of the OCaml opam repo maintainers) offered to help with OCaml 5.0 support.
if we want 5.0 support ASAP, maybe https://github.com/coq/coq/issues/13940 should be marked priority: high
?
any news on that? The final release is on Friday
latest news seems to be here: https://github.com/coq/coq/wiki/Coq-Call-2022-11-30
OCaml 5.0.
- no-naked-pointers not an issue anymore if we disable native compilation
- retargeting native-compilation to use malfunction might be the way to go
maybe something to add to here: https://github.com/coq/coq/wiki/Coq-Call-2022-12-14
I think that Coq master almost compiles with OCaml 5.0, you just have to deactivate the deprecation warnings.
wse should streamline a bit the configure checks and stuff like that but it's reasonable to believe it will reach 8.17 I think
I'm much afraid that current Coq + OCaml 5.0 is giving consistent SEGV fault that I'm unable to debug; will try a different path tho.
it works for me locally, though
Can you try dune build -p coq-core,coq-stdlib,coq
?
The SEGV only shows up for me when compiling in release
mode
See the CI in https://github.com/coq/coq/pull/15494
Seems related to some very tricky GC / memory behavior, because I can only get it on CI or in batch mode, trying to run the faulty command doesn't trigger the problem
(btw you can remove the commit that tweaks ephemerons in that PR, because it's useless now, and in addition it's modifying the API)
It's already reverted on that tree
I'm checking if that commit could be the culprit of the SEGV
ah, but just remove it from the list of commits then
To answer to @kit-ty-kate , Coq's main
branch should compile now with OCaml 5.0, the work we discussed was completed. How well it runs, that's a different story, it is basically untested. We should do a bench to start with.
Pierre-Marie Pédrot said:
ah, but just remove it from the list of commits then
Yup I will, I wanted to keep it in history while the PR is draft just to be able to inspect that change would we eventually need it.
Yup, still booms:
File "theories/dune", line 3104, characters 1-1198:
3104 | (rule
3105 | (targets ConstructiveRcomplete.glob ConstructiveRcomplete.vos
3106 | ConstructiveRcomplete.vo)
......
3114 | ../.././Reals/Cauchy/PosExtra.vo
3115 | ../.././Reals/Cauchy/ConstructiveExtra.vo ../.././QArith/Qminmax.vo)
3116 | (action (run coqc -boot -R ../../. Coq -I ../../../plugins/ltac -I ../../../plugins/derive -I ../../../plugins/btauto -I ../../../plugins/nsatz -I ../../../plugins/ring -I ../../../plugins/ssr -I ../../../plugins/firstorder -I ../../../plugins/syntax -I ../../../plugins/rtauto -I ../../../plugins/funind -I ../../../plugins/micromega -I ../../../plugins/ssrmatching -I ../../../plugins/cc -I ../../../plugins/extraction -I ../../../plugins/ltac2 -w +default -q -w -deprecated-native-compiler-option -native-compiler off %{dep:ConstructiveRcomplete.v})))
(cd _build/default/theories/Reals/Cauchy && ../../../../install/default/bin/coqc -boot -R ../../. Coq -I ../../../plugins/ltac -I ../../../plugins/derive -I ../../../plugins/btauto -I ../../../plugins/nsatz -I ../../../plugins/ring -I ../../../plugins/ssr -I ../../../plugins/firstorder -I ../../../plugins/syntax -I ../../../plugins/rtauto -I ../../../plugins/funind -I ../../../plugins/micromega -I ../../../plugins/ssrmatching -I ../../../plugins/cc -I ../../../plugins/extraction -I ../../../plugins/ltac2 -w +default -q -w -deprecated-native-compiler-option -native-compiler off ConstructiveRcomplete.v)
Command got signal SEGV.
I cut and paste the command, can't reproduce :S
I'm gonna run that in a loop
and add some mem pressure
ah, I just managed to get a SIGSEV
is it related to your PR? when I tweaked the few places to make it compile in 5.0 and it worked on the stdlib
I can reproduce the SIGSEV with 8 cores running but not when calling the compilation by hand
not sure how to debug that though, I would need to hook gdb somehow
Can you check dmesg
? Executable generated by OCaml 5 are much heavier wrt virtual memory. So, perhaps your kernel is just dying under the load?
I'm getting stuff like that:
[23545.033006] coqc[186763]: segfault at 7f80ffc5f000 ip 000055996fad95f2 sp 00007ffce85fe070 error 6 in coqc_bin.exe[55996f3e3000+737000]
[23545.033020] Code: e0 03 48 29 42 08 48 8b 02 48 8b 4a 08 48 39 c1 0f 82 fd 26 00 00 41 8d 44 24 02 49 63 f4 48 98 48 c1 e0 0a 48 05 f7 00 00 00 <48> 89 01 48 8b 42 08 48 c7 40 10 05 00 00 00 49 63 17 48 8d 78 08
can I objdump the binary to see what lives at that address in some way?
(I'm getting a weird error when dumping the binary: "objdump: Erreur DWARF: numéro de ligne de section mutilé")
is the binary corrupted?
Just run gdb and type disas 0x7f80ffc5f000
.
My mistake, it should be disas 0x000055996fad95f2
.
"No function contains specified address."
maybe a dynlinked piece of code?
Make sure that /proc/sys/kernel/randomize_va_space
is set to zero first. Otherwise the addresses are completely random. (But it might also be some dynlinked code indeed.)
It is indeed annoying, on the other hand reproduction is "deterministic" with dune build
I'm gonna change the build rules so we wrap every coqc call in gdb
and hope that I can get to a terminal
with randomize_va_space set to 0 I get the same error at the same place
but still no function at that address
ah, after a bit of linking it does work
and tadaaaa
it's coq_interprete from the VM
(how surprising)
Good that you got some info Pierre-Marie; patch I'm using, not sure it will be enough:
diff --git a/tools/dune_rule_gen/coq_rules.ml b/tools/dune_rule_gen/coq_rules.ml
index 9e4de8cb02..3737a33879 100644
--- a/tools/dune_rule_gen/coq_rules.ml
+++ b/tools/dune_rule_gen/coq_rules.ml
@@ -229,7 +229,7 @@ let module_rule ~(cctx : Context.t) coq_module =
let deps = cctx.async_deps @ deps in
(* Build rule *)
let vfile_base = Path.basename vfile in
- let action = Format.asprintf "(run coqc %s %%{dep:%s})" flags vfile_base in
+ let action = Format.asprintf "(run gdb --eval-command=run --batch --args coqc %s %%{dep:%s})" flags vfile_base in
let targets = Coq_module.obj_files ~tname ~rule coq_module in
let alias = if rule = Coq_module.Rule_type.Quick then Some "vio" else None in
{ Dune_file.Rule.targets; deps; action; alias }
not sure I can dissassemble on sight, but I can give the dump of that function on my machine
Got this on my side
Program received signal SIGSEGV, Segmentation fault.
0x000055555627ac6a in coq_interprete (coq_pc=0x555556ac5620, coq_accu=93825014219600, coq_atom_tbl=140737074123144, coq_global_data=93825014797848, coq_env=140702641348552, coq_extra_args=93825014715928, coq_extra_args@entry=0) at coq_interp.c:811
811 Coq_alloc_small(accu, nfuncs * 3 + nvars, Closure_tag);
the VM closure code strikes again :shrug:
same place on my side, we're onto something
actually no, I misread: mine is failing on
784 Coq_alloc_small(accu, 2 + nvars, Closure_tag);
but that's a pretty close issue
I'm hoping to get more info
Could you look for caml_process_pending_actions_exn
in the file and add Setup_for_gc
on the line before and Restore_after_gc
after?
(The comments in the OCaml interpreter state that "we could get rid of that", but perhaps they are bit optimistic.)
nope still a SIGSEV
the corresponding OCaml code for setup / restore does a lot more stuff though, does this matter?
Not getting any luck reproducing this ...
It does not matter. The OCaml interpreter has to do more work because it needs to safely be able to interrupt itself in case of thread yields. In our case, we are sure that the execution will resume from the point it stopped. (If we ever allow several vm_compute
to run concurrently on the same domain through effect scheduling, we might have to reconsider.)
hmmm, I blindly copied the macro for Alloc_small from the OCaml master branch and I don't get any more SIGSEV
diff --git a/kernel/byterun/coq_interp.c b/kernel/byterun/coq_interp.c
index 0bc79b71ac..847839c65f 100644
--- a/kernel/byterun/coq_interp.c
+++ b/kernel/byterun/coq_interp.c
@@ -115,8 +115,13 @@ if (sp - num_args < coq_stack_threshold) { \
#define Restore_after_caml_call coq_env = *sp++;
#if OCAML_VERSION >= 50000
-#define Coq_alloc_small(result, wosize, tag) Alloc_small(result, wosize, tag, \
- { Setup_for_gc; caml_process_pending_actions(); Restore_after_gc; })
+#define Enter_gc(dom_st, wosize) do { \
+ Setup_for_gc; \
+ Alloc_small_enter_GC_flags(CAML_DO_TRACK | CAML_FROM_CAML, \
+ dom_st, wosize); \
+ Restore_after_gc; \
+ } while (0)
+#define Coq_alloc_small(result, wosize, tag) Alloc_small(result, wosize, tag, Enter_gc)
#else
#define Coq_alloc_small(result, wosize, tag) Alloc_small(result, wosize, tag)
#endif
But CAML_FROM_CAML
is plain wrong in our case.
just reporting random fidgetting with code I do not understand
You should at least use CAML_FROM_C
.
one last round of compilation before I have to go
still no SIGSEV with CAML_FROM_C
Ok, finally got a usable core dump (30Gib tho)
#0 0x000055555627ac6a in coq_interprete (coq_pc=0x555556ac5620, coq_accu=93825014219600, coq_atom_tbl=140737074123144, coq_global_data=93825014797848, coq_env=140702641353472,
coq_extra_args=93825014715928, coq_extra_args@entry=0) at coq_interp.c:811
#1 0x000055555627e6a4 in coq_interprete_ml (tcode=<optimised out>, a=1, t=140737074123144, g=93825014797848, e=140702710707168, ea=1) at coq_interp.c:2006
#2 <signal handler called>
#3 0x00005555560abe61 in camlVmsymtable__eval_to_patch_1655 () at kernel/vmsymtable.ml:325
#4 0x00005555560af828 in camlVconv__vm_conv_gen_1361 () at kernel/vmsymtable.ml:350
#5 0x00005555560afa26 in camlVconv__vm_conv_1593 () at kernel/vconv.ml:210
#6 0x00005555560ce231 in camlTypeops__check_cast_2116 () at kernel/typeops.ml:346
#7 0x00005555560d0072 in camlTypeops__execute_2688 () at kernel/typeops.ml:637
#8 0x00005555560d1e88 in camlTypeops__fun_3974 () at kernel/typeops.ml:772
#9 0x00005555561e94d2 in camlCArray__map_i_1162 () at clib/cArray.ml:518
#10 0x00005555560d1e0e in camlTypeops__execute_array_2691 () at kernel/typeops.ml:772
#11 0x00005555560d057c in camlTypeops__execute_2688 () at kernel/typeops.ml:594
#12 0x00005555560d048d in camlTypeops__execute_2688 () at kernel/typeops.ml:628
#13 0x00005555560d048d in camlTypeops__execute_2688 () at kernel/typeops.ml:628
#14 0x00005555560d048d in camlTypeops__execute_2688 () at kernel/typeops.ml:628
#15 0x00005555560d02c3 in camlTypeops__execute_2688 () at kernel/typeops.ml:610
#16 0x00005555560d02c3 in camlTypeops__execute_2688 () at kernel/typeops.ml:610
#17 0x00005555560d02c3 in camlTypeops__execute_2688 () at kernel/typeops.ml:610
#18 0x00005555560d03d2 in camlTypeops__execute_2688 () at kernel/typeops.ml:623
#19 0x00005555560d1e88 in camlTypeops__fun_3974 () at kernel/typeops.ml:772
#20 0x00005555561e94d2 in camlCArray__map_i_1162 () at clib/cArray.ml:518
#21 0x00005555560d1e0e in camlTypeops__execute_array_2691 () at kernel/typeops.ml:772
#22 0x00005555560d057c in camlTypeops__execute_2688 () at kernel/typeops.ml:594
#23 0x00005555560d03d2 in camlTypeops__execute_2688 () at kernel/typeops.ml:623
#24 0x00005555560d1e88 in camlTypeops__fun_3974 () at kernel/typeops.ml:772
#25 0x00005555561e94d2 in camlCArray__map_i_1162 () at clib/cArray.ml:518
#26 0x00005555560d1e0e in camlTypeops__execute_array_2691 () at kernel/typeops.ml:772
#27 0x00005555560d057c in camlTypeops__execute_2688 () at kernel/typeops.ml:594
#28 0x00005555560d02c3 in camlTypeops__execute_2688 () at kernel/typeops.ml:610
#29 0x00005555560d05fa in camlTypeops__execute_2688 () at kernel/typeops.ml:601
#30 0x00005555560d17cf in camlTypeops__build_one_branch_2803 () at kernel/typeops.ml:702
#31 0x00005555561e94d2 in camlCArray__map_i_1162 () at clib/cArray.ml:518
#32 0x00005555560d0d8a in camlTypeops__execute_2688 () at kernel/typeops.ml:707
#33 0x00005555560d048d in camlTypeops__execute_2688 () at kernel/typeops.ml:628
#34 0x00005555560d02c3 in camlTypeops__execute_2688 () at kernel/typeops.ml:610
#35 0x00005555560d02c3 in camlTypeops__execute_2688 () at kernel/typeops.ml:610
#36 0x00005555560d02c3 in camlTypeops__execute_2688 () at kernel/typeops.ml:610
#37 0x00005555560d02c3 in camlTypeops__execute_2688 () at kernel/typeops.ml:610
#38 0x00005555560d1fec in camlTypeops__infer_2976 () at kernel/typeops.ml:785
#39 0x00005555560d6b38 in camlTerm_typing__check_delayed_1855 () at kernel/term_typing.ml:285
#40 0x00005555560ef87c in camlSafe_typing__check_opaque_3370 () at kernel/safe_typing.ml:906
#41 0x0000555555cd1ea8 in camlOpaques__fun_1316 () at vernac/opaques.ml:97
#42 0x0000555555ca2845 in camlFuture__chain_800 () at vernac/future.ml:151
#43 0x0000555555ca2aab in camlFuture__chain_820 () at vernac/future.ml:163
#44 0x0000555555cd1cfa in camlOpaques__declare_defined_opaque_898 () at vernac/opaques.ml:96
#45 0x0000555555cf3fc5 in camlDeclare__declare_constant_inner_9266 () at vernac/declare.ml:441
#46 0x0000555555cf5540 in camlDeclare__declare_entry_core_inner_9343 () at vernac/declare.ml:658
#47 0x00005555561cbab5 in camlCList__map_i_730 () at clib/cList.ml:330
#48 0x0000555555cfeb60 in camlDeclare__finalize_proof_7042 () at vernac/declare.ml:2102
#49 0x0000555555d79447 in camlVernacinterp__fun_2186 () at vernac/declare.ml:2161
#50 0x00005555561d5a38 in camlNeList__map_head_370 () at clib/neList.ml:35
#51 0x0000555555d79379 in camlVernacinterp__interp_qed_delayed_1717 () at vernac/vernacinterp.ml:228
let me know folks if you need access or me running some gdb command
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055555627ac6a in coq_interprete (coq_pc=0x555556ac5620, coq_accu=93825014219600, coq_atom_tbl=140737074123144, coq_global_data=93825014797848, coq_env=140702641353472,
coq_extra_args=93825014715928, coq_extra_args@entry=0) at coq_interp.c:811
811 Coq_alloc_small(accu, nfuncs * 3 + nvars, Closure_tag);
So, from looking at the OCaml code, it seems that the assumption that running the garbage collector actually frees some memory no longer holds. (The following comment was removed: "When caml_garbage_collection
returns, we assume there is enough space in the minor heap for the triggering allocation.") So, the garbage collector needs to be hammered in a loop until memory has actually been freed, which is the purpose of Alloc_small_enter_GC_flags
.
doesn't seem neither efficient nor guaranteed to terminate to me, but well...
The good news is that it will automatically "solve" https://github.com/coq/coq/issues/13439 by infinite loop.
@Guillaume Melquiond are you writing the PR to fix the code?
I can, but only later today.
No hurry, it was just to be sure.
I was hoping for a bottom line on the OCaml 5.0 thing - are we going to enable that in opam and so on?
A message was moved here from #Coq devs & plugin devs > coq calls by Emilio Jesús Gallego Arias.
Karl Palmskog said:
I was hoping for a bottom line on the OCaml 5.0 thing - are we going to enable that in opam and so on?
Support seems mostly ready modulo testing, there are 2 remaining TODOS:
coq-native
must conflict with OCaml 5.0if nobody else will do it, I guess I can do the extraction fixes PR next week.
@Emilio Jesús Gallego Arias so it is enough to make this package conflict with base-nnp
/ocaml-option-nnpchecker
? Is that the preferred way? https://github.com/ocaml/opam-repository/blob/master/packages/coq-native/coq-native.1/opam
Karl Palmskog said:
Emilio Jesús Gallego Arias so it is enough to make this package conflict with
base-nnp
/ocaml-option-nnpchecker
? Is that the preferred way? https://github.com/ocaml/opam-repository/blob/master/packages/coq-native/coq-native.1/opam
That could work yes, but not sure what the official opam-repos policy is, but seems like a good idea to do that for compilers in the 4.x that don't support NP.
perhaps we can ping @kit-ty-kate for advice on what the "proper" way is to make the virtual coq-native
package mutally exclusive with OCaml 5.
I've done the latest two items in https://github.com/coq/coq/pull/15494 , so maybe we can consider merging it.
Still a small tweak seems needed in the source code tho.
yes conflicting with base-nnp
and ocaml-option-nnpchecker
is indeed the right way to do this
https://github.com/ocaml/opam-repository/pull/22667
looks like after 15494 is merged, the "only" thing left is to get plugins in CI compatible (and I guess there may be library and/or extraction issues as well with some projects)
from opam repo side, we can at least let people test with OCaml 5 coq.dev
by removing the nnp conflict, and add a CI job maybe
@Théo Zimmermann , all the PRs needed for OCaml 5 support are now submitted and placed in the right milestone. They seem pretty minimal to me, so they should be fine for 8.17 , but of course that's the RM call.
this is probably known, but I just tried installing coq.dev
from the core-dev
opam repo with OCaml 5, and I get this error with coq-stdlib.dev
:
dune build --root . theories_dune ltac2_dune
File "lib/control.ml", line 73, characters 8-19:
73 | Thread.exit ()
^^^^^^^^^^^
Error (alert deprecated): Thread.exit
Use 'raise Thread.Exit' instead.
make: *** [Makefile:123: .dune-stamp] Error 1
Will this be fixed by something like https://github.com/coq/coq/pull/15494 or am I missing something?
Seems like your OPAM file is bogus (building with -warn-error)
Can you point out to which one your opam is using concretely?
Sorry never mind what I said, yes, this needs a patch for the bootstrap
let me submit it
https://github.com/coq/opam-coq-archive/blob/master/core-dev/packages/coq-stdlib/coq-stdlib.dev/opam
and yes it is fixed in the CI PR
@Karl Palmskog https://github.com/coq/coq/pull/17010
thanks for noticing that bit, was the only fix not submitted as separate PR
thanks, I'll test everything again after it (hopefully) gets merged
tried PR 17010 and started playing around with a bunch of plugins and extraction projects. Everything seems to work so far, all I got were some additional deprecation warnings for the extraction projects.
now coq.dev
from the core-dev
opam repo works fine to install on 5.0.0! :tada:
(I hope that 8.17 backports are still on the table.)
If master works then I believe we can run a bench
@Karl Palmskog all the PRs currently merged in master and targeting 8.17 (I think this is the case for all OCaml 5-compatibility PRs) have been backported to v8.17
except coq/coq#17010, which is in the process of being backported (but given that it just fixes a warning, I don't think it is required for compatibility). So in principle, coq.8.17.dev
should be as compatible with OCaml 5 as coq.dev
right now.
without 17010 the installation of coq.8.17.dev
will break
OK. Expect this PR to be pushed to v8.17
by the end of today.
so maybe you can ping me in to that backport PR and I will update coq.8.17.dev
in opam archive when merged
I can ping you when the PR is backported. There is no "backport PR" FTR, but you can use https://github.com/coq/coq/projects/56 or https://coq.zulipchat.com/#narrow/stream/240008-GitHub-notifications/topic/Coq.20Push.20Notifications to track what is backported FYI.
OK, then I guess you'll either ping me, or I check later and do the opam archive PR and ping you.
maybe a good idea to add an OCaml 5 job in opam archive at the same time.
@Karl Palmskog OK, that last PR is backported now.
looks like we are open for business (have working CI job) now in testing packages with 8.17+rc1 and OCaml 5.0.0 in the opam archive
the bench job honestly looks a bit disheartening. Based on this, most Coq projects would be better off staying on OCaml 4 if they could (no Coq project in CI saw an improvement, and most are 10-20% slower).
Or am I reading it wrong? Will stuff like flambda, etc., change the picture?
Don't you want flambda for both OCaml versions? 10-20% is the speedup I'd expect from it
Sorry: it's just in the right ballpark, but it could be tons of other things...
I was thinking there was a chance that flambda could affect 4 and 5 differently.
(skims log) I sed, that's comparing 4 and 5 both without flambda. Yeah, "both with flambda" would be more interesting.
looks like 4.14.1 is also out, maybe this is what we will be using for Coq for the next few years? https://github.com/ocaml/ocaml/releases/tag/4.14.1
another theory: the heap tuning stuff in Coq might be off since they seemingly changed memory management quite a lot.
Possible, and/or 5.0.0 still has immature performance but that'll be fixed soon.
https://ocaml.org/news/ocaml-5.0 has specifics on "immature performance" in the intro
hmm, the only thing I see on performance is:
OCaml 5 as a language is fully compatible with OCaml 4 down to the performance characteristics of your programs. In other words, any code that works with OCaml 4 should work the same with OCaml 5.
which, if we take the bench at face value, is not the case here in the Coq sphere
has anybody opened an OCaml issue on the topic? We have some data points with the bench.
Consequently, OCaml 5.0.0 is expected to be a more experimental version of OCaml than the usual OCaml releases. [...]
the performance of ephemerons is currently (and temporarily) strongly degraded.
I guess I could do it [open an issue], but maybe some Coq core dev with connections to the OCaml team would be a better choice
does Coq even use ephemerons?
seems so but I'd really rather defer to sb with more clue: https://github.com/coq/coq/blob/110921a449fcb830ec2a1cd07e3acc32319feae6/clib/cEphemeron.ml wraps Ephemeron and that's used a few times https://github.com/coq/coq/search?q=cephemeron
CEphemeron are not quite ephemerons, rather it's a weak table.
(Yes, this is basically the same thing, but the performance profile should be distinct.)
I don't think it would be enough to produce the differences we see in the bench. Rather, I'm siding with the people who believe there is some GC option that went awry.
do we want to summon someone from the OCaml team here maybe? Gabriel has commented before, and might at least advice on next steps.
I've forwarded the results to the multicore team in Slack
The top slower lines in the bench are Qeds, which put a heavy pressure on hashconsing. Given that we implement hashconsing with weak tables (and thus internally ephemerons) the slowdown might come from there.
My understanding is that ocaml 5, if you don't take advantage of parallelism, is expected to be slightly slower. Which make sense, there is no miracle.
I wonder if things like CoqHammer become easier with OCaml 5.0. E.g. the reason it is currently not supported on Windows is that it relies on forks to run different reconstructions tactics in parallel starting from the current state and then kills all except the first successful one. Would OCaml 5 help with that?
I think you would need fine-grained thread support in Coq (application level) first?
but I think threads/tasks inside a language are generally considered much more portable/manageable than processes...
The reason CoqHammer forks is that it wants to do explore from the current state into different directions. The par:
support likely does something similar, so I guess there is some way to "fork the current state" on thread level in Coq already. @Enrico Tassi : can you share your insights into this?
par:
is processes too, not threads
to get threads into Coq like Isabelle, I think you'd have to do something equivalent to what Makarius did, e.g., in
Shared-Memory Multiprocessing for Interactive Theorem Proving
most tactics don't touch global state so running tactics is almost thread safe, but there are exceptions, eg
abstract
(global env, nametab)with_strategy
(implemented by globally setting and resetting the conversion strategy state)ring
(some crazy global ref hack)ah, but the plugins are a different story, no? So to be on safe side, you'd probably want to mark tactics as parallel-safe or not
Enrico Tassi said:
My understanding is that ocaml 5, if you don't take advantage of parallelism, is expected to be slightly slower. Which make sense, there is no miracle.
The release notes say: "OCaml 5 as a language is fully compatible with OCaml 4 down to the performance characteristics of your programs", so purely sequential programs are expected to be exactly as fast as before.
as per the bench, the performance part is not even close to on par to regular 4.14 for Coq (some projects had >= 40% slowdown)
I see. Well I would appreciate if it would be easier to write plugins like CoqHammer in a platform independent way (since all the platform dependent stuff goes to my desk). If one moves to OCaml 5, one should put this on the agenda.
I think the OCaml multicore effort took 8 years of work, hopefully it might be less for Coq
IMHO good automation is essential for practical use of proof assistants and efficient platform independent development of plugins is essential for automation. So this is not just a "nice to have" thing.
I completely agree it should be a priority to get fine-grained thread parallelism into Coq. I tried to collect state of the art in proof assistant parallel processing in this ISSTA'18 paper. Some things may have happened since then, but I believe Lean 4 did not do proof-parallel processing, only file level
Rodolphe Lepigre said:
Enrico Tassi said:
My understanding is that ocaml 5, if you don't take advantage of parallelism, is expected to be slightly slower. Which make sense, there is no miracle.
The release notes say: "OCaml 5 as a language is fully compatible with OCaml 4 down to the performance characteristics of your programs", so purely sequential programs are expected to be exactly as fast as before.
I don't recall my source, and I'm sure they did the best they could to be as fast as before, but adding features can hardly come at no cost. IMO 10% slower would still be more than acceptable (I'd pay much more for fair scheduling, which seems to be part of the deal here). Unfortunately the benchmarks are a bit worse than that (so far, tuning seems still in progress). But don't get me wrong, I think 5.0 is a very impressive piece of work.
even with the same GC approach in 4.14/5.0, typical projects get a slowdown of 15-30% in the latest bench. That's a pretty high current price for the possibility of Coq-multicore improvements in the future...
I'd be cautious about trying to infer any useful from the current bench numbers, there are still far from final.
The branch goes back to default GC settings, I guess to measure a baseline. These settings are know to be much worse on 4.x, so looking at these numbers is surely misleading. I guess we will need to wait for @Emilio Jesús Gallego Arias to do his measurements and write a summary.
Yes, in particular OCaml 5 GC still needs to get the porting of some nice 4.x GC features that indeed didn't make the cut for the 5.x release; I understand this is nothing intrinsic to multi-core, just that the rewrite didn't manage to catch up yet with 4.x development
The only baseline we have for Coq is the PLDI multicore paper
I mean the only baseline that has been properly verified.
which paper is that more precisely?
IIRC "Retrofitting Parallelism onto OCaml" and "Retrofitting Effect Handlers onto OCaml"
Coq benches are only in the latter (PLDI 2021) but indeed they are against stock OCaml
I guess you mean this and this
Emilio Jesús Gallego Arias said:
this is nothing intrinsic to multi-core
One of the fundamental design choices of Multicore OCaml was that memory allocation should not require any synchronization between domains (unless memory is exhausted). In particular, none of the OCaml 4 memory allocators (including the "best-fit" one used by Coq) are suitable for OCaml 5 since they require synchronization of their free lists (or equivalent). So, OCaml 5 comes with its own naive allocator. And it has to be naive since every domain is responsible for sweeping its own allocations, so it has to be able to track them cheaply. I am sure that the OCaml developers will be able to make it less naive as time goes, but this is not just a matter of porting a few changes from OCaml 4, it is a complete reimplementation.
I find it a bit difficult to reconcile this with the claim that "OCaml 5 as a language is fully compatible with OCaml 4 down to the performance characteristics of your programs". Seems like this is bound to lead to many letdowns (and criticism).
why didn't they just say something like that the performance will be worse than OCaml4 in some cases [not just ephemerons], but is expected to improve with additional work on GC?
I find the sentence a bit ambiguous, but even at the current state, I don't find the slowdown unacceptable for a first release. It is not that you cant use Coq if it is 20% slower, that is how I interpret the sentence.
IMO it is good you can finally run it, and do proper measurements. Nobody forces us to switch to 5.0 for the platform, for example, so they have time to improve it while we still get the same perf in "production".
agree with both: it's acceptable for a first release, and good enough for testing, and it'll disappoint users who ship it in production
Guillaume Melquiond said:
Emilio Jesús Gallego Arias said:
this is nothing intrinsic to multi-core
One of the fundamental design choices of Multicore OCaml was that memory allocation should not require any synchronization between domains (unless memory is exhausted). In particular, none of the OCaml 4 memory allocators (including the "best-fit" one used by Coq) are suitable for OCaml 5 since they require synchronization of their free lists (or equivalent). So, OCaml 5 comes with its own naive allocator. And it has to be naive since every domain is responsible for sweeping its own allocations, so it has to be able to track them cheaply. I am sure that the OCaml developers will be able to make it less naive as time goes, but this is not just a matter of porting a few changes from OCaml 4, it is a complete reimplementation.
Indeed I'm not sure the 4.x changes being ported to OCaml 5 (such as prefetching) will help a lot, but there is a backlog there, if I understood OCaml devs correctly, and as noted in the PR.
For now, the causes of slowdowns over 10% are not well understood and indeed, the first step is to check the backlog. And indeed, if you look at the latest results it does matter, tho still not the root cause of the problem for the large slowdowns
OK, I think I now have a better intuition for some of the OCaml 5 bench results. Some projects like fourcolor have a much smaller slowdown than others in OCaml 5, and it isn't clear why. However, fourcolor happens to be one of the most file-level parallelizable projects out there (the other one is Verdi Raft), which makes it able to "catch up" in real time to OCaml 4 using several cores. So I'd conjecture that the OCaml 5 slowdown is consistently in the 30% range with no parallelism
Does that mean the slowdown is smaller in real time, but should be the same in user time?
that's what I see locally, at least. A signficant increase in user time for all projects, but only modest real time increase for highly parallelizable projects
but more reliable user time measurements are hard to get for some reason, I guess
every measure so far is very unscientific. I hope to get access to a 16-core Ryzen 9 soon and will probably run some more informal but more robust benchmarking there
the bench runs with -j1 (except for the stdlib) so I don't understand why file level parallelism matters
The observation wasn't only based on the bench. But, for example this bench took around 10 min for Verdi Raft. You'd need super fast single core to do the whole of Verdi Raft in 10 min, CI with multicore regularly takes 14 min.
I don't understand your argument, that bench certainly ran with -j1 (you can see the command opam ran in https://gitlab.com/coq/coq/-/jobs/3496212740/artifacts/file/_bench/logs/coq-verdi-raft.NEW.1.1.perf)
CI only uses -j2 and also includes some deps in the job (see https://github.com/coq/coq/blob/master/dev/ci/ci-verdi_raft.sh)
also 5.0 is about intra file parallelism not inter file, how would inter file parallelism change?
(and the coq version benched does not exploit the new threading capabilities)
on my current 2019-era Ryzen running around 3.5Ghz, time make -j1
for Verdi Raft (not including its dependencies) takes around 16 min on OCaml 5 Coq. So if we take those bench numbers at face value, doesn't seem plausible they are actually done with a single core.
the bench machines are https://www.intel.com/content/www/us/en/products/sku/191043/intel-xeon-e2246g-processor-12m-cache-3-60-ghz/specifications.html
The numbers in https://gitlab.com/coq/coq/-/jobs/3496212740 claim to be about user time, so they can't be affected by this "scheduling" theory
OTOH, PMP claimed the slowdown affects mostly Qed times, which matter differently for different projects.
Also I wouldn't bet on different machines having the same slowdown: if those Qed are memory-bound, a better memory subsystem might lessen the impact.
In general, a single-core program could use multiple cores for parallel GC, and JVMs have that option; that might explain how j1 isn't really single-threaded. But user time would account for that, and Guillaume's explanation suggests that effect should not matter, by design?
So, I did some perf
-testing. The slowdown seems to come for the most part from dynamic linking. The code was rewritten for OCaml 5 (presumably so that plugins can be loaded concurrently from any domain) and it is now quadratic in the number of plugins. Since Coq is .cmxs
-heavy, if you have lots of small .v
files, performances crash down.
This does not seem to correspond to the per-line diff from the bench, though. Did you test with native-compiler enabled? If so this would explain why cmxs become an issue.
Forget about it, I just remembered that native compilation is disabled in Coq + OCaml 5.0.
Still, slowdowns are concentrated in Qed lines and those don't perform dynlink without native compiler on.
Most .cmxs
are loaded through the prelude. I doubt the bench measures the time needed to load the prelude.
Stephen Dolan confirms my understanding of the code: "The trouble is that the OCaml 5 dynlink logic is fundamentally O(n^2) as it rebuilds the entire frame descriptor table each time, while the OCaml 4 version is O(n) as it mutates the existing one in-place."
Indeed the per-line diffs are not observing the prelude loading (but the total time does). But still, the number of dynlinked cmxs is not that big (I know that because these days the debugger stops on every dynlink) so even a quadratic blowup shouldn't that terrible.
Apparently, it is quadratic in the number of modules rather than plugins.
Ah. That's definitely worse, indeed.
Quoting again: "It's much worse: it rebuilds the table once per module contained in each loaded .cmxs, because it needs a populated table between the initialisation of any two modules."
any hope to do better? I don't think Coq is the only OCaml development relying on dynlink
In fact, the problem was originally detected with Frama-C: "the startup time for Frama-C is about 1.2s longer, which, for short benchmarks, can increase execution by 300% or 400%."
I know that because these days the debugger stops on every dynlink
put set break_on_load off
in your $HOME/.ocamldebug
we can see the increase in eg https://coq.gitlab.io/-/coq/-/jobs/3496212740/artifacts/_bench/html/coq-hott/theories/Basics/Notations.v.html and https://coq.gitlab.io/-/coq/-/jobs/3496212740/artifacts/_bench/html/coq-hott/theories/Basics/Overture.v.html
(hott is -noinit so the declare ml module commands are in the files and appear in the line diff)
Nice find @Guillaume Melquiond ; indeed what I call "document creation time" is often far from trivial; I measure it with coq-lsp.
Maybe we should add a patch so we emit a time sentence for the prelude loading? That's quite easy to do.
Just wrap the call to Stm.new_doc
PR done.
Do we gain something by wrapping all plugins? I guess in that case each plugin is one module, but I m not sure
that's not how wrapping works
and we already wrap plugins
so wrapping is not using -pack?
correct
coq_makefile does, funny dune does not. I'm wondering if there is an option for it, since it could mitigate this problem.
While there is only one module left after packing, there are still as many frame tables as in the unpacked case (and in fact, one more, for the packed module itself). In other words, packing is really just a concatenation of modules nowadays. So, it would not help reducing the number of frame tables (which is the actual thing that matters for dynlink performances).
Last updated: Oct 13 2024 at 01:02 UTC