I made a change that replaced associative lists with tries and FMaps, and Recursive Extraction
shot up from around 20s to around 90--95 minutes. How do I debug this? (cc @Pierre-Marie Pédrot ) (Code here)
Could you do an strace before and after? They should have around the same lines. If they do then we know extraction isn't messing around finding files.
We can also drop some Printfs in extraction to find out what it is spending it's time on.
... or perf the extraction process, you'll get a much clearer picture.
Usually extraction is slow because of the kind of optimizations it performs, but I haven't checked yet on this one.
I'll try perfing it shortly, but note that this is with Global Unset Extraction Optimize.
Should I just do perf record -g
on coqc?
Indeed. The fact this happens even with optimization unset makes it even more interesting, also.
(I was not able to compile locally your branch with coq master, is it expected?)
(ah, no, it was me not realizing this dev had recursive submodules and I didn't update them accordingly)
The top of perf report
on the fast (20s) dev:
Samples: 102K of event 'cpu-clock:uhpppH', Event count (approx.): 25656500000
Children Self Command Shared Object Symbol
+ 57.57% 0.49% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__fun_3448
+ 41.45% 0.03% coqc coqc [.] camlPp__prlist_sep_lastsep_606
+ 41.43% 0.10% coqc coqc [.] camlStdlib__list__map_233
+ 41.12% 0.02% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Common__pp_tuple_1590
+ 36.24% 0.49% coqc coqc [.] camlCList__map_802
+ 33.40% 3.58% coqc coqc [.] camlCList__map_loop_795
+ 24.25% 0.17% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Extraction__extract_maybe_term_3305
+ 22.30% 0.03% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Common__pp_ocaml_gen_3427
+ 21.79% 0.13% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Extraction__f_3431
+ 21.65% 0.03% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__pp_one_pat_2041
+ 21.22% 0.15% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Common__pp_ocaml_local_3405
+ 20.11% 0.00% coqc coqc [.] camlPp__prvecti_with_sep_647
+ 20.04% 0.02% coqc coqc [.] camlStdlib__array__mapi_175
+ 19.97% 0.01% coqc coqc [.] camlPp__fun_1318
+ 19.91% 0.02% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__fun_3715
+ 19.24% 2.73% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Common__clash_3361
+ 19.12% 0.47% coqc coqc [.] camlStdlib__list__fold_left_272
+ 18.55% 0.27% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Extraction__extract_cons_app_3309
+ 18.16% 0.00% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__ppl_2955
+ 18.14% 0.00% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__prlist_sep_nonempty_2942
+ 17.93% 0.00% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__pp_decl_2648
+ 17.70% 0.00% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__pp_function_2043
+ 17.20% 0.11% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Extraction__extract_term_3304
+ 17.08% 0.00% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__pp_structure_elem_2905
+ 17.03% 0.00% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__try_pp_structure_elem_2934
+ 16.96% 0.00% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Ocaml__pp_module_expr_2906
+ 14.37% 0.00% coqc coqc [.] camlFlags__with_modified_ref_inner_288
+ 14.21% 0.00% coqc coqc [.] camlVernacinterp__interp_gen_1036
+ 14.16% 2.78% coqc coqc [.] camlStdlib__map__mem_288
vs, on the first 60 seconds of the slow dev (recording data for more than that is too much for perf report
on my laptop):
Samples: 232K of event 'cpu-clock:uhpppH', Event count (approx.): 58173250000
Children Self Command Shared Object Symbol
+ 94.81% 1.09% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Extraction__extract_type_2708
+ 94.81% 0.24% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Extraction__extract_type_app_2709
+ 94.81% 0.21% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Extraction__fun_4736
+ 94.59% 0.54% coqc coqc [.] camlStdlib__list__fold_right_278
+ 92.04% 0.00% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Extraction__extract_type_cons_2713
+ 91.45% 0.00% coqc [unknown] [.] 0x0000000000000001
+ 90.76% 0.00% coqc [unknown] [.] 0x000000000000000d
+ 90.00% 0.00% coqc [unknown] [.] 0x8d493672087e3b4d
+ 57.03% 14.33% coqc coqc [.] camlConstr__map_with_binders_2478
+ 56.35% 10.53% coqc coqc [.] camlCArray__map_1659
+ 55.52% 0.41% coqc coqc [.] camlStdlib__array__map_154
+ 55.25% 0.20% coqc coqc [.] camlConstr__map_branches_with_binders_2270
+ 55.04% 0.64% coqc coqc [.] camlConstr__map_under_context_with_binders_2260
+ 49.55% 0.32% coqc coqc [.] camlRetyping__sort_family_of_1961
+ 38.41% 0.18% coqc coqc [.] camlRetyping__concl_of_arity_1040
+ 23.74% 0.34% coqc coqc [.] camlReductionops__red_of_state_red_3904
+ 19.93% 0.02% coqc coqc [.] camlReductionops__aux_3127
+ 11.28% 0.54% coqc coqc [.] camlRetyping__type_of_1541
+ 8.90% 0.31% coqc coqc [.] camlEnviron__lookup_constant_key_1446
+ 8.89% 0.70% coqc coqc [.] camlStdlib__array__copy_116
+ 8.41% 0.08% coqc coqc [.] camlRetyping__type_of_global_reference_knowing_parameters_1543+ 8.38% 1.17% coqc coqc [.] caml_array_sub
+ 8.19% 0.43% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Mlutil__expand_2536
+ 7.74% 0.28% coqc coqc [.] camlInductive__type_of_inductive_gen_1677
+ 7.32% 0.25% coqc coqc [.] camlInductive__instantiate_universes_1657
+ 6.68% 0.47% coqc extraction_plugin.cmxs [.] camlExtraction_plugin__Extraction__mlt_env_2714
+ 6.03% 0.22% coqc coqc [.] camlInductive__make_1626
+ 6.01% 0.14% coqc coqc [.] camlRetyping__anomaly_on_error_483
+ 5.49% 0.56% coqc coqc [.] camlStdlib__map__find_opt_280
what file is this exactly?
Flame graph for the fast one:
perf-extract-unsaturated-solinas-fast.svg
what file is this exactly?
src/ExtractionOCaml/unsaturated_solinas.v
Invocation was
COQPATH="$(pwd)/coqprime/src:$(pwd)/rupicola/bedrock2/deps/coqutil/src:$(pwd)/rupicola/src:$(pwd)/rupicola/bedrock2/bedrock2/src:$(pwd)/rupicola/bedrock2/compiler/src:$(pwd)/rupicola/bedrock2/deps/riscv-coq/src:$(pwd)/rewriter/src" command time -f "src/ExtractionOCaml/unsaturated_solinas.ml (real: %e, user: %U, sys: %S, mem: %M ko)" perf record -g -o /tmp/perf-extract-unsaturated-solinas-timeout-60.data timeout 60 "coqc" -q -w +implicit-core-hint-db,+implicits-in-term,+non-reversible-notation,+deprecated-intros-until-0,+deprecated-focus,+unused-intro-pattern,+variable-collision,+omega-is-deprecated,+deprecated-instantiate-syntax,+non-recursive,unsupported-attributes -w -notation-overridden,-undeclared-scope,-deprecated-hint-rewrite-without-locality,-deprecated-hint-constr,-fragile-hint-constr,-native-compiler-disabled,-ambiguous-paths,-masking-absolute-name "-w" "-deprecated-native-compiler-option" "-native-compiler" "ondemand" -R src Crypto src/ExtractionOCaml/unsaturated_solinas.v
I'm not used to perfing fiat-crypto too much but right now I'm seeing very suspicious stuff in the profile, before reaching that file
specifically, Coq passing its time in the generation of inductive schemes
in src/Assembly/Symbolic.v to be precise
Plausibly this is because we have an inductive with almost 30 constructors in Symbolic.v, and decidable equality schemes are cubic last I checked?
way worse than cubic if I read correctly
Which part of Symbolic.v are you seeing as slow? Do you see the same issues in
Require Import Coq.NArith.NArith.
Definition idx := N.
Local Set Decidable Equality Schemes.
Definition symbol := N.
Class OperationSize := operation_size : N.
Section S.
Implicit Type s : OperationSize.
Variant op := old s (_:symbol) | const (_ : Z) | add s | addcarry s | subborrow s | addoverflow s | neg s | shl s | shr s | sar s | rcr s | and s | or s | xor s | slice (lo sz : N) | mul s | set_slice (lo sz : N) | selectznz | iszero (* | ... *)
| addZ | mulZ | negZ | shlZ | shrZ | andZ | orZ | xorZ | addcarryZ s | subborrowZ s.
End S.
? (All of the scheme generation feels relatively fast to me, though (on the branch but not on master) functor application is kind-of slow.)
btw, if you want a target that makes just one of the extracted things and its dependencies and no more (sorry for being so late giving this to you), you can run make src/ExtractionOCaml/unsaturated_solinas.ml SKIP_BEDROCK2=1
Here's the flamegraph for the 60s of the slow extraction:
perf-extract-unsaturated-solinas-slow-timeout-60.svg
Pierre-Marie Pédrot said:
way worse than cubic if I read correctly
https://github.com/coq/coq/issues/8517 says cubic based on measurements I took a while ago
Oh, wait, no, that just says that you told me it was cubic
There are two different issues at play here with scheme generation.
it's essentially cubic in the number of constructors, but that's not the problem here
rather, the algorithm to compute the dependencies on other inductive scheme is pretty much arbitrary
I suspect that in this case it's just crawling through huge terms, including proofs
Which inductive scheme generation is slow in Symbolic.v? I don't recall any non-proof non-functor hotspots in that file
I interrupted this file after waiting for more than 40 minutes
Seems to be
Record symbolic_state := { dag_state :> dag ; symbolic_reg_state :> reg_state ; symbolic_flag_state :> flag_state ; symbolic_mem_state :> mem_state }.
I confirm that unsetting the equality scheme generation flag makes this instantatenous
That is very strange. In https://github.com/mit-plv/fiat-crypto/commit/3f55b840657cc952007744a2c5ba38e25e17469a I added a bunch of functor instantiations to Symbolic.v which bumped the time on my machine from 15s to 25s and from 673MB RAM to 1.3GB RAM. This is on something between Coq 8.11 and 8.15. You can check our CI for timing data of this file across the versions of Coq (the end of each build step displays a table of timing information). For it to be taking 40 minutes on your machine means that something very strange is going on...
The commit message for https://github.com/mit-plv/fiat-crypto/commit/5342ad973e465642645f25b072cca4051f86291e, btw, includes a table which says how long it should take to build each file
@Pierre-Marie Pédrot any updates, or insights about why extract_type
might be taking so long? (Should I open a Coq issue?)
@Pierre-Marie Pédrot Here's a reduced example showing that Recursive Extraction
is exponential: https://github.com/coq/coq/issues/16172 The hotspot in perf is different, though, not sure what to make of that.
wdym? extract_type is the leading issue in you perf profile
(I haven't managed to compile fiat-crypto with your patch yet, I can still reproduce the Symbolic.v explosion and it kills my laptop after some time)
per line timings for symbolic.v in https://gitlab.com/coq/coq/-/jobs/2396228741 show nothing interesting
there are no recent ones because fiat-crypto is broken in the bench snce last month
Last updated: Sep 30 2023 at 06:01 UTC