I'm hitting very strange failures in the test-suite related to stm / -async-proofs on
in this PR on using a new algorithm to check universes https://github.com/coq/coq/pull/16022/checks?check_run_id=6564335792
What are common sources of incompatibility with the async-mode? I suppose it can be related to the use of side-effects (here we use a marking scheme for nodes in a graph, much like in acyclicGraph.ml) ? @Enrico Tassi maybe you have some idea?
do you mean https://gitlab.com/coq/coq/-/jobs/2496277480 ? similar errors happen randomly with async, I don't know why
Yes I mean that.
Also, I cannot reproduce it locally
rerunning got another similar error https://gitlab.com/coq/coq/-/jobs/2497747558#L2480
This is a typical issue with schedule order of universe constraints.
Depending on the order in which the constraints are added this might change unification.
(it has been happening on master since the dawn of the async test)
that makes no sense, the error in the rerun is
Error: Illegal application:
The term "@eq" of type "forall A : Type, A -> A -> Prop"
cannot be applied to the term
"A" : "Set"
This term has type "Set" which should be coercible to
"Type".
when is Set not <= the Type of eq? note eq is not template poly
This precise error is quite reproducible, it keeps happening.
I can only agree with what was written here.
I have no other insight
If this error happens it seems it would mean there are cases where some check happen between the time some universe u is added to the graph and the time we add the Set <= u constraint. Does this mean that we should enter a kind of critical section when we do that to be safe?
I guess I don't have (anymore, if ever) a good mental model of how the STM deals with delaying of universe checking
My understanding what that the constraint was hardcoded. Is it not true?
I don't think that, in the presence of worker delegation, there is any race.
But data is marshalled, maybe some code uses == too much (wild guess).
I mean Unmarshal(Marshal(Set)) != Set
or the like
Matthieu Sozeau said:
If this error happens it seems it would mean there are cases where some check happen between the time some universe u is added to the graph and the time we add the Set <= u constraint. Does this mean that we should enter a kind of critical section when we do that to be safe?
nothing about this error makes sense to me
consider we are in some random file
Init.Logic has already been required and its contraints are present in the graph
The constraint is hardcoded indeed, each time we introduce a universe we add a constraint
But it's also true that the graph uses == a lot
Is the algorithm use ==
is a non-opportunistic way?
Yes
We use ==
on canonical nodes everywhere.
I mean, even conversion does use ==
, but also tries =
in case data was unmarshaled (from a vo)
canonical nodes don't get marshalled do they?
Do you re-canonicalize after unmarshaling?
(the STM does not, but maybe you did after Require, hence the bug)
The environment contains a UGraph, so I suppose it gets marshalled and unmarshalled, no?
canonical nodes only live in acyclic graph
we don't marshal that
do we marshal the env?
no, we don't
we re-build it
(but the STM does marshal it)
the env is in the summary so I guess it gets marshalled for async
but I don't see how that leads to issues in the graph, I'm not convinced that it's related
canonical_nodes are obtained by doing a Map.find such that physical equality should still be fine on them
Wait, I don't understand your answer Enrico, we don't marshal it but the STM does marshal it?
we don't marshal it for .vo but we do for async
Ok
It's true that canonical nodes are the result of a Map.find so the functions should be stable even if the exact pointers change, as we don't maintain canonical_node
adresses anywhere else, I think.
Indeed, the STM marshals the summary
Sorry if my guess is off. But one way of not finding something that is always there is to search for it in a wrong way (eg == instead of =)
I can't imagine any other source for this bug
Yep, I guess we should check that
it's a pretty rare bug tbh, in 20 pages of pipelines (15 / page) I found just https://gitlab.com/coq/coq/-/jobs/2428121408, some timeouts and a couple segfaults (https://gitlab.com/coq/coq/-/jobs/1995832064 and https://gitlab.com/coq/coq/-/jobs/1981695288)
We could also print a more informative message (e.g. the whole graph, so that we see if it empty or not). I mean, if I'm right the constraint would be displayed, but not found
I know it is rare, but maybe it's because it is that weird that I do remember it ;-)
This is also why I did not sped a lot of time looking in to it
I have invented perhaps my greatest hack yet "ultradebug" https://github.com/coq/coq/commit/9f452746efc6698cde6d5ae4580e8dd3aa53b782
it dumps vernacstate + the current command to a file when a command fails
ran it over 70 times until I got a reproduction https://gitlab.com/SkySkimmer/coq/-/jobs/2506693702 (error in https://github.com/coq/coq/blob/f2031737b6ffed2a687f2bb8b27a12924f96619e/theories/Numbers/DecimalNat.v#L86 while checking the type) (50 times in this pipeline but there were a couple before it)
the first attempt to use it locally failed and I almost gave up, but I found that Future.chain forces and raises on NotHere
commenting out that force worked and I could reproduce in the debugger
as we might expect it fails in Reduction.check_leq on Set
and Coq.Init.Logic.9
(also known as eq.u0)
the universe graph looks normal
(ocd) p univs
univs: UGraph.t =
Set < Coq.setoid_ring.BinList.1
< Coq.Numbers.BinNums.1
< Coq.Init.Datatypes.1
...
< Coq.Init.Logic.9
...
finally we reach https://github.com/coq/coq/blob/f2031737b6ffed2a687f2bb8b27a12924f96619e/lib/acyclicGraph.ml#L650
and repr_node u
(where u
is Set
) produces....
{canon = <abstr>; ltle = <abstr>; gtge = <abstr>;
rank = 1000000; klvl = 0; ilvl = -2568; status = Visited}
VISITED?!
Ah ah, nice methodology
Nice hack ! So the (mutable) marks are causing the problem here, but in principle we can not run concurrently two check_leq
's (or one check_leq
and one enforce
) on the same graph, or can we?
even if we did I don't see how the Visited would escape to be marshalled
Maybe I have an idea, hang tight
So master has a graph (which I now learn has a mutable part). Master decides to delegate a proof (and an env containing that graph). A thread takes care of delegation (spawning, marshalling, receiving the result). The thread does not run immediately (but has a pointer to the right graph). The main thread continues on the document, does more type checking and that involves this test which modifies the graph in place. The thread is scheduled, and marshalls to the worker a graph with Visited. The worker does some type checking under a wrong invariant.
Is this plausible? (I don't really know this check_leq code)
plausible imo
A way to check the hypothesis:
I've no clue if a better fix is possible, I don't know if the invariant can be weakened and the algorithm made "concurrent".
In any case, kudos to you Ultradebugger
So after all the CI-job would have discovered this problem when the new algorithm was introduced.. if only it was there back then.
with the randomness I don't know if we would have known what change caused the failures
Sure this is really nasty bug, but if you managed to fiund the issue now you would have also found it back then, maybe even without ultradebug. If a job starts to fail in a crazy way, even only once a day, well you don't have many merged PRs / changes to double check.
When the job was finally put in place, we had a few recurring issues. Together with Maxime we fixed those we could reproduce, but this one (or the segfaults in reals/) were out of my league.
it's closer to once a month than once a day though
IIRC
PS the future in the proof_object has name "unnamed" if that helps tell what's going on with workers
I'll keep this riddle for Monday ;-)
Last updated: Dec 07 2023 at 14:02 UTC