I'm hitting very strange failures in the test-suite related to stm / `-async-proofs on`

in this PR on using a new algorithm to check universes https://github.com/coq/coq/pull/16022/checks?check_run_id=6564335792

What are common sources of incompatibility with the async-mode? I suppose it can be related to the use of side-effects (here we use a marking scheme for nodes in a graph, much like in acyclicGraph.ml) ? @Enrico Tassi maybe you have some idea?

do you mean https://gitlab.com/coq/coq/-/jobs/2496277480 ? similar errors happen randomly with async, I don't know why

Yes I mean that.

Also, I cannot reproduce it locally

rerunning got another similar error https://gitlab.com/coq/coq/-/jobs/2497747558#L2480

This is a typical issue with schedule order of universe constraints.

Depending on the order in which the constraints are added this might change unification.

(it has been happening on master since the dawn of the async test)

that makes no sense, the error in the rerun is

```
Error: Illegal application:
The term "@eq" of type "forall A : Type, A -> A -> Prop"
cannot be applied to the term
"A" : "Set"
This term has type "Set" which should be coercible to
"Type".
```

when is Set not <= the Type of eq? note eq is not template poly

This precise error is quite reproducible, it keeps happening.

I can only agree with what was written here.

I have no other insight

If this error happens it seems it would mean there are cases where some check happen between the time some universe u is added to the graph and the time we add the Set <= u constraint. Does this mean that we should enter a kind of critical section when we do that to be safe?

I guess I don't have (anymore, if ever) a good mental model of how the STM deals with delaying of universe checking

My understanding what that the constraint was hardcoded. Is it not true?

I don't think that, in the presence of worker delegation, there is any race.

But data is marshalled, maybe some code uses == too much (wild guess).

I mean `Unmarshal(Marshal(Set)) != Set`

or the like

Matthieu Sozeau said:

If this error happens it seems it would mean there are cases where some check happen between the time some universe u is added to the graph and the time we add the Set <= u constraint. Does this mean that we should enter a kind of critical section when we do that to be safe?

nothing about this error makes sense to me

consider we are in some random file

Init.Logic has already been required and its contraints are present in the graph

The constraint is hardcoded indeed, each time we introduce a universe we add a constraint

But it's also true that the graph uses == a lot

Is the algorithm use `==`

is a non-opportunistic way?

Yes

We use `==`

on canonical nodes everywhere.

I mean, even conversion does use `==`

, but also tries `=`

in case data was unmarshaled (from a vo)

canonical nodes don't get marshalled do they?

Do you re-canonicalize after unmarshaling?

(the STM does not, but maybe you did after Require, hence the bug)

The environment contains a UGraph, so I suppose it gets marshalled and unmarshalled, no?

canonical nodes only live in acyclic graph

we don't marshal that

do we marshal the env?

no, we don't

we re-build it

(but the STM does marshal it)

the env is in the summary so I guess it gets marshalled for async

but I don't see how that leads to issues in the graph, I'm not convinced that it's related

canonical_nodes are obtained by doing a Map.find such that physical equality should still be fine on them

Wait, I don't understand your answer Enrico, we don't marshal it but the STM does marshal it?

we don't marshal it for .vo but we do for async

Ok

It's true that canonical nodes are the result of a Map.find so the functions should be stable even if the exact pointers change, as we don't maintain `canonical_node`

adresses anywhere else, I think.

Indeed, the STM marshals the summary

Sorry if my guess is off. But one way of not finding something that is always there is to search for it in a wrong way (eg == instead of =)

I can't imagine any other source for this bug

Yep, I guess we should check that

it's a pretty rare bug tbh, in 20 pages of pipelines (15 / page) I found just https://gitlab.com/coq/coq/-/jobs/2428121408, some timeouts and a couple segfaults (https://gitlab.com/coq/coq/-/jobs/1995832064 and https://gitlab.com/coq/coq/-/jobs/1981695288)

We could also print a more informative message (e.g. the whole graph, so that we see if it empty or not). I mean, if I'm right the constraint would be displayed, but not found

I know it is rare, but maybe it's because it is that weird that I do remember it ;-)

This is also why I did not sped a lot of time looking in to it

I have invented perhaps my greatest hack yet "ultradebug" https://github.com/coq/coq/commit/9f452746efc6698cde6d5ae4580e8dd3aa53b782

it dumps vernacstate + the current command to a file when a command fails

ran it over 70 times until I got a reproduction https://gitlab.com/SkySkimmer/coq/-/jobs/2506693702 (error in https://github.com/coq/coq/blob/f2031737b6ffed2a687f2bb8b27a12924f96619e/theories/Numbers/DecimalNat.v#L86 while checking the type) (50 times in this pipeline but there were a couple before it)

the first attempt to use it locally failed and I almost gave up, but I found that Future.chain forces and raises on NotHere

commenting out that force worked and I could reproduce in the debugger

as we might expect it fails in Reduction.check_leq on `Set`

and `Coq.Init.Logic.9`

(also known as eq.u0)

the universe graph looks normal

```
(ocd) p univs
univs: UGraph.t =
Set < Coq.setoid_ring.BinList.1
< Coq.Numbers.BinNums.1
< Coq.Init.Datatypes.1
...
< Coq.Init.Logic.9
...
```

finally we reach https://github.com/coq/coq/blob/f2031737b6ffed2a687f2bb8b27a12924f96619e/lib/acyclicGraph.ml#L650

and `repr_node u`

(where `u`

is `Set`

) produces....

```
{canon = <abstr>; ltle = <abstr>; gtge = <abstr>;
rank = 1000000; klvl = 0; ilvl = -2568; status = Visited}
```

VISITED?!

Ah ah, nice methodology

Nice hack ! So the (mutable) marks are causing the problem here, but in principle we can not run concurrently two `check_leq`

's (or one `check_leq`

and one `enforce`

) on the same graph, or can we?

even if we did I don't see how the Visited would escape to be marshalled

Maybe I have an idea, hang tight

So master has a graph (which I now learn has a mutable part). Master decides to delegate a proof (and an env containing that graph). A thread takes care of delegation (spawning, marshalling, receiving the result). The thread does not run immediately (but has a pointer to the right graph). The main thread continues on the document, does more type checking and that involves this test which modifies the graph in place. The thread is scheduled, and marshalls to the worker a graph with Visited. The worker does some type checking under a wrong invariant.

Is this plausible? (I don't really know this check_leq code)

plausible imo

A way to check the hypothesis:

- unmark (restore the invariants) in the worker on reception (hopefully easy to code and test, ugly but still)
- copy on master (but the STM does some work not to copy eagerly the data of tasks... that is a larger change)

I've no clue if a better fix is possible, I don't know if the invariant can be weakened and the algorithm made "concurrent".

In any case, kudos to you Ultradebugger

So after all the CI-job would have discovered this problem when the new algorithm was introduced.. if only it was there back then.

with the randomness I don't know if we would have known what change caused the failures

Sure this is really nasty bug, but if you managed to fiund the issue *now* you would have also found it back then, maybe even without ultradebug. If a job starts to fail in a crazy way, even only once a day, well you don't have many merged PRs / changes to double check.

When the job was finally put in place, we had a few recurring issues. Together with Maxime we fixed those we could reproduce, but this one (or the segfaults in reals/) were out of my league.

it's closer to once a month than once a day though

IIRC

PS the future in the proof_object has name "unnamed" if that helps tell what's going on with workers

I'll keep this riddle for Monday ;-)

Last updated: Aug 03 2024 at 00:02 UTC