Stream: Coq users

Topic: Using profiler completely changes performance profile?


view this post on Zulip MackieLoeffel (Feb 26 2024 at 08:56):

Hi, I started playing around with the profiler in Coq 8.19.0 in RefinedC and I noticed that enabling the profiler can completely change the performance profile of coqc. Is this expected or is there anything one can do about this? As it currently stands, this makes the profiler basically useless for me.

Concretely, I encountered the following example.
Without profiling:

        coqc examples/scheduler/src/proofs/fdsched/generated_proof_fds_init.{glob,vo}
Finished transaction in 1.256 secs (1.256u,0.s) (successful)
Finished transaction in 10.321 secs (10.3u,0.02s) (successful)
Finished transaction in 3.149 secs (3.146u,0.002s) (successful)
Finished transaction in 0.381 secs (0.381u,0.s) (successful)
Finished transaction in 1.366 secs (1.364u,0.001s) (successful)
Finished transaction in 0.481 secs (0.481u,0.s) (successful)
Finished transaction in 1.129 secs (1.129u,0.s) (successful)
    34.38s  99%

With profiling:

        coqc examples/scheduler/src/proofs/fdsched/generated_proof_fds_init.{glob,vo}
Finished transaction in 1.322 secs (1.255u,0.067s) (successful)
Finished transaction in 12.491 secs (12.071u,0.419s) (successful)
Finished transaction in 3.789 secs (3.622u,0.166s) (successful)
Finished transaction in 0.596 secs (0.559u,0.036s) (successful)
Finished transaction in 22.411 secs (18.686u,3.72s) (successful)
Finished transaction in 0.597 secs (0.554u,0.043s) (successful)
Finished transaction in 18.288 secs (15.266u,3.019s) (successful)
    80.50s  99%

Note that the 5ht and 7th Time commands get an order of magnitude slower from profiling.

For reference, this is the file that is compiled: (If someone is interested in replicating this locally, let me know.)

From refinedc.typing Require Import typing.
From refinedc.examples.scheduler.src.fdsched Require Import generated_code.
From refinedc.examples.scheduler.src.fdsched Require Import generated_spec.
From refinedc.examples.scheduler.src.fdsched Require Import message_extra.
From caesium Require Import builtins_specs.
From refinedc.examples.scheduler.src.fdsched Require Import priority_extra.
From refinedc.examples.scheduler.src.fdsched Require Import scheduler_extra.
From refinedc.examples.scheduler.src.fdsched Require Import fdsched_extra.
From refinedc.examples.scheduler.src.fdsched Require Import fdsched_extra.
From refinedc.typing Require Import malloc.
From refinedc.examples.scheduler.src.fdsched Require Import fdsched_extra.
Set Default Proof Using "Type".


(* Generated from [examples/scheduler/src/fdsched.c]. *)
Section proof_fds_init.
  Context `{!typeG Σ} `{!globalG Σ}.
  Context `{!PacketArrivals}.
  Context `{!mallocG Σ}.

  (* Typing proof for [fds_init]. *)
  Lemma type_fds_init (global_nop_callback global_prio_level_init : loc) :
    global_nop_callback  global_nop_callback @ function_ptr type_of_nop_callback -
    global_prio_level_init  global_prio_level_init @ function_ptr type_of_prio_level_init -
    typed_function (impl_fds_init global_nop_callback global_prio_level_init) type_of_fds_init.
  Proof.
    Local Open Scope printing_sugar.
    start_function "fds_init" ([fds_refn p]) => arg_fds local_prio local_typ.
    prepare_parameters (fds_refn p).
    split_blocks ((
      <[ "#4" :=
         priority : nat,
        local_typ  uninit (it_layout i32) 
        local_prio  (priority @ (int (i32))) 
        arg_fds  (p @ (&own (struct (struct_fd_scheduler) [@{type} uninit (mk_array_layout i32 16) ; (0) @ (int (u64)) ; struct (struct_npfp_scheduler) [@{type} array (layout_of struct_callback) (replicate (Z.to_nat num_msg_types) 0%nat `at_type` callback_t) ; array_p (struct_message_queue) (replicate priority (@nil message_data) `at_type`  message_queue) (Z.to_nat num_priorities) ; (replicate (Z.to_nat num_priorities) false) @ (prio_bitmap_t) ] ]))) 
        priority <= num_priorities
    ]> $
      <[ "#1" :=
         i : nat,
        local_prio  uninit (it_layout i32) 
        local_typ  (i @ (int (i32))) 
        arg_fds  (p @ (&own (struct (struct_fd_scheduler) [@{type} uninit (mk_array_layout i32 16) ; (0) @ (int (u64)) ; struct (struct_npfp_scheduler) [@{type} array_p (layout_of struct_callback) (replicate i 0%nat `at_type` callback_t) (Z.to_nat num_msg_types) ; uninit (mk_array_layout struct_message_queue (Z.to_nat num_priorities)) ; (replicate (Z.to_nat num_priorities) false) @ (prio_bitmap_t) ] ]))) 
        i <= num_msg_types
    ]> $
      
    )%I : gmap label (iProp Σ)) (
      @nil Prop
    ).
    - Time repeat liRStep; liShow.
      all: print_typesystem_goal "fds_init" "#0".
    - Time repeat liRStep; liShow.
      all: print_typesystem_goal "fds_init" "#4".
    - Time repeat liRStep; liShow.
      all: print_typesystem_goal "fds_init" "#1".
    Unshelve. all: unshelve_sidecond; sidecond_hook; prepare_sideconditions; normalize_and_simpl_goal; try solve_goal; unsolved_sidecond_hook.
    + Time by apply list_subequiv_split; solve_goal.
    + Time have Hprio: priority = (Z.to_nat num_priorities); solve_goal.
    + Time by apply list_subequiv_split; solve_goal.
    + Time have Hi: i = (Z.to_nat num_msg_types); solve_goal.
    all: print_sidecondition_goal "fds_init".
    Unshelve. all: try done; try apply: inhabitant; print_remaining_shelved_goal "fds_init".
  Qed.
End proof_fds_init.

view this post on Zulip Janno (Feb 26 2024 at 09:33):

https://github.com/coq/coq/issues/18489 and https://github.com/coq/coq/pull/18483 are probably relevant here

view this post on Zulip Gaëtan Gilbert (Feb 26 2024 at 09:39):

yes, you can try eg env COQ_PROFILE_COMPONENTS=command,parse_command coqc -profile ...
the profile will still have all components in the "subtimes" annotation but only command/parse_command (and process/init) will have full events which AFAICT is the part that gets really slow

view this post on Zulip Gaëtan Gilbert (Feb 26 2024 at 09:39):

then you can look at the "subtimes" for possibly interesting components to re-enable

view this post on Zulip MackieLoeffel (Feb 26 2024 at 10:40):

Thanks, this makes the two runs indeed a lot closer. Is there a good way to visualize the subtimes, e.g. in https://www.speedscope.app/ ? Or do I have to look at the json manually?

view this post on Zulip Gaëtan Gilbert (Feb 26 2024 at 10:52):

https://ui.perfetto.dev/

view this post on Zulip MackieLoeffel (Feb 26 2024 at 13:25):

Thanks! Quick follow up, is there documentation somewhere what the components mean?

E.g. in my example, I have the following:

Conversion 5.29E+05 us, 41427 calls
intern 1.44E+03 us, 76 calls
interp 1.06E+07 us, 1 call
pretyping 5.06E+06 us, 8644 calls
synterp 2.15 us, 1 call
typeclass search 4.22E+06 us, 547 calls
unfreeze_full_state 5.96 us, 1 call
unification 4.53E+05 us, 61975 calls

I can guess what typeclass search, unification and Conversion are, but what are pretyping and interp?

view this post on Zulip Gaëtan Gilbert (Feb 26 2024 at 13:38):

there's no documentation, they're pretty adhoc
you can grep the code of coq for NewProfile.profile "$thestring"

intepreting a user term eg 0 + 0 has 2 phases:

executing a command is divided in 2 phases (ignoring STM nonsense)

view this post on Zulip MackieLoeffel (Feb 26 2024 at 13:39):

Makes sense, thanks for the explanation!


Last updated: Jun 23 2024 at 04:03 UTC