Stream: Coq devs & plugin devs

Topic: STM woes


view this post on Zulip Matthieu Sozeau (May 24 2022 at 08:14):

I'm hitting very strange failures in the test-suite related to stm / -async-proofs on in this PR on using a new algorithm to check universes https://github.com/coq/coq/pull/16022/checks?check_run_id=6564335792
What are common sources of incompatibility with the async-mode? I suppose it can be related to the use of side-effects (here we use a marking scheme for nodes in a graph, much like in acyclicGraph.ml) ? @Enrico Tassi maybe you have some idea?

view this post on Zulip Gaëtan Gilbert (May 24 2022 at 08:35):

do you mean https://gitlab.com/coq/coq/-/jobs/2496277480 ? similar errors happen randomly with async, I don't know why

view this post on Zulip Matthieu Sozeau (May 24 2022 at 08:45):

Yes I mean that.

view this post on Zulip Matthieu Sozeau (May 24 2022 at 08:46):

Also, I cannot reproduce it locally

view this post on Zulip Gaëtan Gilbert (May 24 2022 at 09:07):

rerunning got another similar error https://gitlab.com/coq/coq/-/jobs/2497747558#L2480

view this post on Zulip Pierre-Marie Pédrot (May 24 2022 at 09:20):

This is a typical issue with schedule order of universe constraints.

view this post on Zulip Pierre-Marie Pédrot (May 24 2022 at 09:20):

Depending on the order in which the constraints are added this might change unification.

view this post on Zulip Pierre-Marie Pédrot (May 24 2022 at 09:20):

(it has been happening on master since the dawn of the async test)

view this post on Zulip Gaëtan Gilbert (May 24 2022 at 09:24):

that makes no sense, the error in the rerun is

Error: Illegal application:
The term "@eq" of type "forall A : Type, A -> A -> Prop"
cannot be applied to the term
 "A" : "Set"
This term has type "Set" which should be coercible to
"Type".

when is Set not <= the Type of eq? note eq is not template poly

view this post on Zulip Pierre-Marie Pédrot (May 24 2022 at 09:27):

This precise error is quite reproducible, it keeps happening.

view this post on Zulip Enrico Tassi (May 24 2022 at 10:24):

I can only agree with what was written here.

view this post on Zulip Enrico Tassi (May 24 2022 at 10:24):

I have no other insight

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:03):

If this error happens it seems it would mean there are cases where some check happen between the time some universe u is added to the graph and the time we add the Set <= u constraint. Does this mean that we should enter a kind of critical section when we do that to be safe?

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:05):

I guess I don't have (anymore, if ever) a good mental model of how the STM deals with delaying of universe checking

view this post on Zulip Enrico Tassi (May 25 2022 at 12:06):

My understanding what that the constraint was hardcoded. Is it not true?
I don't think that, in the presence of worker delegation, there is any race.
But data is marshalled, maybe some code uses == too much (wild guess).

view this post on Zulip Enrico Tassi (May 25 2022 at 12:06):

I mean Unmarshal(Marshal(Set)) != Set

view this post on Zulip Enrico Tassi (May 25 2022 at 12:06):

or the like

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 12:07):

Matthieu Sozeau said:

If this error happens it seems it would mean there are cases where some check happen between the time some universe u is added to the graph and the time we add the Set <= u constraint. Does this mean that we should enter a kind of critical section when we do that to be safe?

nothing about this error makes sense to me
consider we are in some random file
Init.Logic has already been required and its contraints are present in the graph

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:07):

The constraint is hardcoded indeed, each time we introduce a universe we add a constraint

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:07):

But it's also true that the graph uses == a lot

view this post on Zulip Enrico Tassi (May 25 2022 at 12:07):

Is the algorithm use == is a non-opportunistic way?

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:07):

Yes

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:08):

We use == on canonical nodes everywhere.

view this post on Zulip Enrico Tassi (May 25 2022 at 12:08):

I mean, even conversion does use ==, but also tries = in case data was unmarshaled (from a vo)

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 12:09):

canonical nodes don't get marshalled do they?

view this post on Zulip Enrico Tassi (May 25 2022 at 12:09):

Do you re-canonicalize after unmarshaling?

view this post on Zulip Enrico Tassi (May 25 2022 at 12:09):

(the STM does not, but maybe you did after Require, hence the bug)

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:10):

The environment contains a UGraph, so I suppose it gets marshalled and unmarshalled, no?

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 12:10):

canonical nodes only live in acyclic graph
we don't marshal that

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 12:10):

do we marshal the env?

view this post on Zulip Enrico Tassi (May 25 2022 at 12:11):

no, we don't

view this post on Zulip Enrico Tassi (May 25 2022 at 12:11):

we re-build it

view this post on Zulip Enrico Tassi (May 25 2022 at 12:11):

(but the STM does marshal it)

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 12:11):

the env is in the summary so I guess it gets marshalled for async
but I don't see how that leads to issues in the graph, I'm not convinced that it's related

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 12:12):

canonical_nodes are obtained by doing a Map.find such that physical equality should still be fine on them

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:13):

Wait, I don't understand your answer Enrico, we don't marshal it but the STM does marshal it?

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 12:13):

we don't marshal it for .vo but we do for async

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:14):

Ok

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:17):

It's true that canonical nodes are the result of a Map.find so the functions should be stable even if the exact pointers change, as we don't maintain canonical_node adresses anywhere else, I think.

view this post on Zulip Enrico Tassi (May 25 2022 at 12:17):

Indeed, the STM marshals the summary

view this post on Zulip Enrico Tassi (May 25 2022 at 12:18):

Sorry if my guess is off. But one way of not finding something that is always there is to search for it in a wrong way (eg == instead of =)

view this post on Zulip Enrico Tassi (May 25 2022 at 12:18):

I can't imagine any other source for this bug

view this post on Zulip Matthieu Sozeau (May 25 2022 at 12:19):

Yep, I guess we should check that

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 12:21):

it's a pretty rare bug tbh, in 20 pages of pipelines (15 / page) I found just https://gitlab.com/coq/coq/-/jobs/2428121408, some timeouts and a couple segfaults (https://gitlab.com/coq/coq/-/jobs/1995832064 and https://gitlab.com/coq/coq/-/jobs/1981695288)

view this post on Zulip Enrico Tassi (May 25 2022 at 12:21):

We could also print a more informative message (e.g. the whole graph, so that we see if it empty or not). I mean, if I'm right the constraint would be displayed, but not found

view this post on Zulip Enrico Tassi (May 25 2022 at 12:22):

I know it is rare, but maybe it's because it is that weird that I do remember it ;-)

view this post on Zulip Enrico Tassi (May 25 2022 at 12:22):

This is also why I did not sped a lot of time looking in to it

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 19:17):

I have invented perhaps my greatest hack yet "ultradebug" https://github.com/coq/coq/commit/9f452746efc6698cde6d5ae4580e8dd3aa53b782
it dumps vernacstate + the current command to a file when a command fails
ran it over 70 times until I got a reproduction https://gitlab.com/SkySkimmer/coq/-/jobs/2506693702 (error in https://github.com/coq/coq/blob/f2031737b6ffed2a687f2bb8b27a12924f96619e/theories/Numbers/DecimalNat.v#L86 while checking the type) (50 times in this pipeline but there were a couple before it)
the first attempt to use it locally failed and I almost gave up, but I found that Future.chain forces and raises on NotHere
commenting out that force worked and I could reproduce in the debugger
as we might expect it fails in Reduction.check_leq on Set and Coq.Init.Logic.9 (also known as eq.u0)
the universe graph looks normal

(ocd) p univs
univs: UGraph.t =
  Set < Coq.setoid_ring.BinList.1
      < Coq.Numbers.BinNums.1
      < Coq.Init.Datatypes.1
...
      < Coq.Init.Logic.9
...

finally we reach https://github.com/coq/coq/blob/f2031737b6ffed2a687f2bb8b27a12924f96619e/lib/acyclicGraph.ml#L650
and repr_node u (where u is Set) produces....

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 19:17):

  {canon = <abstr>; ltle = <abstr>; gtge = <abstr>;
   rank = 1000000; klvl = 0; ilvl = -2568; status = Visited}

VISITED?!

view this post on Zulip Maxime Dénès (May 25 2022 at 19:54):

Ah ah, nice methodology

view this post on Zulip Matthieu Sozeau (May 25 2022 at 20:17):

Nice hack ! So the (mutable) marks are causing the problem here, but in principle we can not run concurrently two check_leq's (or one check_leq and one enforce) on the same graph, or can we?

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 20:22):

even if we did I don't see how the Visited would escape to be marshalled

view this post on Zulip Enrico Tassi (May 25 2022 at 20:24):

Maybe I have an idea, hang tight

view this post on Zulip Enrico Tassi (May 25 2022 at 20:26):

So master has a graph (which I now learn has a mutable part). Master decides to delegate a proof (and an env containing that graph). A thread takes care of delegation (spawning, marshalling, receiving the result). The thread does not run immediately (but has a pointer to the right graph). The main thread continues on the document, does more type checking and that involves this test which modifies the graph in place. The thread is scheduled, and marshalls to the worker a graph with Visited. The worker does some type checking under a wrong invariant.

view this post on Zulip Enrico Tassi (May 25 2022 at 20:29):

Is this plausible? (I don't really know this check_leq code)

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 20:30):

plausible imo

view this post on Zulip Enrico Tassi (May 25 2022 at 20:33):

A way to check the hypothesis:

I've no clue if a better fix is possible, I don't know if the invariant can be weakened and the algorithm made "concurrent".

view this post on Zulip Enrico Tassi (May 25 2022 at 20:34):

In any case, kudos to you Ultradebugger

view this post on Zulip Enrico Tassi (May 25 2022 at 20:35):

So after all the CI-job would have discovered this problem when the new algorithm was introduced.. if only it was there back then.

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 20:40):

with the randomness I don't know if we would have known what change caused the failures

view this post on Zulip Enrico Tassi (May 25 2022 at 20:46):

Sure this is really nasty bug, but if you managed to fiund the issue now you would have also found it back then, maybe even without ultradebug. If a job starts to fail in a crazy way, even only once a day, well you don't have many merged PRs / changes to double check.

view this post on Zulip Enrico Tassi (May 25 2022 at 20:48):

When the job was finally put in place, we had a few recurring issues. Together with Maxime we fixed those we could reproduce, but this one (or the segfaults in reals/) were out of my league.

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 21:00):

it's closer to once a month than once a day though

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 21:01):

IIRC

view this post on Zulip Gaëtan Gilbert (May 25 2022 at 21:06):

PS the future in the proof_object has name "unnamed" if that helps tell what's going on with workers

view this post on Zulip Enrico Tassi (May 25 2022 at 21:27):

I'll keep this riddle for Monday ;-)


Last updated: Dec 07 2023 at 14:02 UTC