Stream: Coq devs & plugin devs

Topic: Deadlock in OCaml runtime


view this post on Zulip Lasse (Mar 12 2021 at 18:37):

Hi All,
I'm doing some experiments with timeout functions, and I'm encountering yet another really strange bug. Previously, the timeout facility had many problems, but they have all been fixed (most recently https://github.com/coq/coq/pull/13377). Those bugs caused the timeout not to be triggered, letting the tactic run forever. However, now I am encountering a problem where Coq completely deadlocks (the process enters a sleeping state and never exists). This is pretty crazy, because as far as I know, Coq is not multithreaded. This makes me think there is a bug in the OCaml runtime.

Steps to reproduce: Compile the following script under Coq master. After 5 to 10 minutes (on my machine), the coqc process will enter a sleeping state.

Ltac loop t := case t; loop t.
Lemma test (P : Prop) (b : bool) : P.
do 1000 ((timeout 1 loop b) || idtac "hi").
Qed.

I ran the process under strace, of which the tail is as follows:

getitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}) = 0
rt_sigaction(SIGALRM, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fdea5efc960}, {sa_handler=0x55be595567b0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fdea5efc960}, 8) = 0
write(1, "hi\n", 3hi
)                     = 3
getitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}) = 0
rt_sigaction(SIGALRM, {sa_handler=0x55be595567b0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fdea5efc960}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fdea5efc960}, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=1, tv_usec=0}}, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}) = 0
futex(0x55be5b56d450, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
futex(0x55be5b56d450, FUTEX_WAIT_PRIVATE, 2, NULL

The code waits for a mutex (the last fuxtex call), and never returns from that. I cannot find any mutex-calling code in Coq. OCaml's runtime does use mutexes, which makes me think the runtime is somehow screwing up in handling the SIGALRM signal. There are some vaguely related recent PR's in the ocaml repo recently: https://github.com/ocaml/ocaml/pull/9846 and https://github.com/ocaml/ocaml/pull/9722

Do any of you know how to proceed with this? Do you concur that this is likely an OCaml issue, and should I escalate it there?

view this post on Zulip Enrico Tassi (Mar 12 2021 at 18:49):

Frankly I don't know what to suggest. Do you really really really need to use timeout?

view this post on Zulip Enrico Tassi (Mar 12 2021 at 18:49):

Because any boring fuel-based piece of code will surely work. It will suck, but will work.

view this post on Zulip Lasse (Mar 12 2021 at 18:55):

In a normal situation, I would indeed avoid timeout. However, deciding whether a tactic terminates or not is undecidable, and for Tactician I'm dealing with black-box tactics that I cannot just modify to use fuel. The only way I can think of to use fuel is to built this into the proof monad somehow. But that would be a rather invasive change.

This all is very crazy to me. The timeout function is not some super complicated function, and in any other programming language things like this just work. Surely other OCaml programs also use timeouts, and signals etc right (like the async library for example)?

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:01):

I don't think it would be crazy to add a timeout facility to the proof monad. It's a monad, you check the time in bind... Of course nested timeouts are harder, but a global one would be simple. @Pierre-Marie Pédrot what do you think?

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:02):

expect catastrophic perfs

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:03):

My little exprerience with these libraries is that they don't work reliably.

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:03):

@Lasse it's not as if I didn't comment in the PR at the time that this code was surely wrong

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:04):

Well, in this case it might be that the error is in the OCaml runtime, but still

view this post on Zulip Lasse (Mar 12 2021 at 19:04):

@Pierre-Marie Pédrot I know, but I can not see how this code could possibly cause a mutex-lock to never return.

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:05):

If you can emulate it as an OCaml standalone file it might be a good thing to report it on the OCaml bugtracker

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:06):

The Ltac code looks simple enough to be emulated by a simple OCaml script

view this post on Zulip Lasse (Mar 12 2021 at 19:06):

Yeah, I've been considering doing that, but this might be pretty difficult. Might give it a try anyway

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:06):

Ok, maybe checking a bool (is timeout set) at every bind is too much, but maybe at each tclTHEN it is not too much

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:06):

I mean, at each ;

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:07):

(also it's breaking referential transparence)

view this post on Zulip Lasse (Mar 12 2021 at 19:07):

@Enrico Tassi I don't think that is enough. For example, a rewrite ?H can also easily hang

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:07):

(but that's already the case with timeout alone)

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:07):

internally it uses ;

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:07):

or try

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:07):

we already have active polling as Control.check_for_interrupt

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:07):

we should probably rely on that

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:08):

@Lasse I am running this on my machine to see whether it is reproducible

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:08):

Seriously, timout is not a common thing. Even if you get 30% perf hit, I would not care

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:09):

Check for interrupt is also fine, if it works

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:09):

I confirm I can reproduce

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:10):

(under CoqIDE at least)

view this post on Zulip Lasse (Mar 12 2021 at 19:10):

Great, so it is not just me being crazy then...

view this post on Zulip Lasse (Mar 12 2021 at 19:11):

If all tactics that could potentially loop use tclTHEN under the covers, that would be fine with me. But I kind of suspect that there may also be other ways of creating a loop

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:13):

Yes, it can just take a lot of time in conversion or unification

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:13):

That's the reason we have polling for these functions

view this post on Zulip Lasse (Mar 12 2021 at 19:13):

I assume that this polling is for the purpose of the user Aborting?

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:14):

yes

view this post on Zulip Enrico Tassi (Mar 12 2021 at 19:14):

I'd say "interrupting Coq"

view this post on Zulip Lasse (Mar 12 2021 at 19:14):

Right, the big red button in Coqide

view this post on Zulip Pierre-Marie Pédrot (Mar 12 2021 at 19:15):

Lock reproduced in coqc as well

view this post on Zulip Enrico Tassi (Mar 12 2021 at 20:37):

That is interesting, since coqc should not really have any threads

view this post on Zulip Lasse (Mar 12 2021 at 20:37):

Exactly, that is why I'm thinking this must be an issue in the OCaml runtime.

view this post on Zulip Enrico Tassi (Mar 12 2021 at 20:39):

FTR, coqidetop has threads, there is one per proof worker reading on a socket.

view this post on Zulip Enrico Tassi (Mar 12 2021 at 20:39):

they are not "computing" threads, there is no parallelism

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:43):

Indeed, my first suggestion is to try to reproduce with the simple compiler

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:43):

that should be free of Thread, but I will verify.

view this post on Zulip Lasse (Mar 12 2021 at 20:43):

What do you mean with the simple compiler? Just coqc?

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:44):

Nope is a special compiler branch we have to actually remove the very tricky part that is the document manager [and as @Enrico Tassi pointed out sometimes uses Thread] from the equation

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:45):

I actually maybe upstream it soon, as an internal tool [it is very simple]

view this post on Zulip Enrico Tassi (Mar 12 2021 at 20:45):

coqc is still linked with the threads library, but should now spawn any thread

view this post on Zulip Lasse (Mar 12 2021 at 20:45):

Ah, okay that would indeed be useful

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:45):

https://github.com/ejgallego/coq/tree/simple_compiler

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:46):

here it is , but I need to rebase etc... don't have the time today

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:46):

likely it doesn't change anything but we can remove one variable

view this post on Zulip Lasse (Mar 12 2021 at 20:46):

Right, certainly good to try. Thanks

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:46):

Also try with OCaml 4.12

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:46):

there is a PR, but you should be able to compile just disabling warn-error

view this post on Zulip Lasse (Mar 12 2021 at 20:46):

Yeah, I wanted to do that, but I think Coq needs a patch right?

view this post on Zulip Lasse (Mar 12 2021 at 20:47):

Ah okay, will try

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:47):

let me rebase simple_compiler in a sec

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 20:58):

done

view this post on Zulip Lasse (Mar 12 2021 at 20:59):

Thanks! I'll hopefully try it out tomorrow. Should I just compile it and then run coqc as normal?

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 21:03):

Yup, tho coqc is more limited here

view this post on Zulip Emilio Jesús Gallego Arias (Mar 12 2021 at 21:03):

use make -f Makefile.dune world to compile

view this post on Zulip Lasse (Mar 15 2021 at 07:50):

I can still reproduce the issue with @Emilio Jesús Gallego Arias's simplified compiler. So I guess the next step will be to investigate the OCaml compiler. At least trying the newest version. If anyone has other ideas please let me know.

view this post on Zulip Lasse (Mar 15 2021 at 08:53):

I am now thinking that this is the result of a bug in glibc. See https://discuss.ocaml.org/t/is-there-a-known-recent-linux-locking-bug-that-affects-the-ocaml-runtime/6542/ https://sourceware.org/bugzilla/show_bug.cgi?id=25847#c8 https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800
Someone even verified this bug in TCLA+: https://probablydance.com/2020/10/31/using-tla-in-the-real-world-to-understand-a-glibc-bug/

This all sounds very similar to this issue here. Looks like not only OCaml is affected, but also Python .NET and more

view this post on Zulip Enrico Tassi (Mar 15 2021 at 09:01):

Nice finding! Apparently the bug is still open upstream... The option of using something like Control.check_for_interrupt seems to be the best one, at least you don't nee a glibc release in order to have your code work.

view this post on Zulip Lasse (Mar 15 2021 at 09:03):

Yes, ubuntu has patched glibc recently but the glibc project itself is still debating the patch. I'll first verify if this bug is indeed the culprit (by trying an older version of glibc that didn't have the bug.

view this post on Zulip Lasse (Mar 15 2021 at 20:12):

Mhm, I ran the script on a centos server with glibc 2.17 (which should be unaffected according to the bug reports), and I can still reproduce the problem. So this goes back to being mystery I guess.

view this post on Zulip Emilio Jesús Gallego Arias (Mar 15 2021 at 21:14):

@Lasse I've pushed a new version of the simple compiler that removes the need for thread really, there are still a few uses that could have weird effect, in particular Thread is called in exception handlers which could create a mess. I just pushed the code to the branch and didn't do an analysis as maybe we can test first and analyze later.

view this post on Zulip Lasse (Mar 16 2021 at 02:47):

@Emilio Jesús Gallego Arias With your new commit, I don't seem to be able to reproduce the problem anymore. Good call! Looking through the code you removed, the most suspect part is indeed in exinfo.ml. Looks like those mutexes are somehow deadlocking. I'll try to take a closer look tomorrow.

view this post on Zulip Emilio Jesús Gallego Arias (Mar 16 2021 at 14:27):

Yeah that mess with Exninfo was always very suspicious to me, didn't have the time study it tho, I did perform some large cleanup of exn in Coq but didn't reach that far.

view this post on Zulip Emilio Jesús Gallego Arias (Mar 16 2021 at 14:27):

Please open a bug as I'm sure other people will understand quickly was going on.

view this post on Zulip Lasse (Mar 17 2021 at 03:09):

See https://github.com/coq/coq/issues/13949


Last updated: Oct 21 2021 at 21:03 UTC