Hi All,
I'm doing some experiments with timeout functions, and I'm encountering yet another really strange bug. Previously, the timeout facility had many problems, but they have all been fixed (most recently https://github.com/coq/coq/pull/13377). Those bugs caused the timeout not to be triggered, letting the tactic run forever. However, now I am encountering a problem where Coq completely deadlocks (the process enters a sleeping state and never exists). This is pretty crazy, because as far as I know, Coq is not multithreaded. This makes me think there is a bug in the OCaml runtime.
Steps to reproduce: Compile the following script under Coq master. After 5 to 10 minutes (on my machine), the coqc
process will enter a sleeping state.
Ltac loop t := case t; loop t.
Lemma test (P : Prop) (b : bool) : P.
do 1000 ((timeout 1 loop b) || idtac "hi").
Qed.
I ran the process under strace
, of which the tail is as follows:
getitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}) = 0
rt_sigaction(SIGALRM, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fdea5efc960}, {sa_handler=0x55be595567b0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fdea5efc960}, 8) = 0
write(1, "hi\n", 3hi
) = 3
getitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}) = 0
rt_sigaction(SIGALRM, {sa_handler=0x55be595567b0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fdea5efc960}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fdea5efc960}, 8) = 0
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=1, tv_usec=0}}, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=0, tv_usec=0}}) = 0
futex(0x55be5b56d450, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
futex(0x55be5b56d450, FUTEX_WAIT_PRIVATE, 2, NULL
The code waits for a mutex (the last fuxtex
call), and never returns from that. I cannot find any mutex-calling code in Coq. OCaml's runtime does use mutexes, which makes me think the runtime is somehow screwing up in handling the SIGALRM
signal. There are some vaguely related recent PR's in the ocaml repo recently: https://github.com/ocaml/ocaml/pull/9846 and https://github.com/ocaml/ocaml/pull/9722
Do any of you know how to proceed with this? Do you concur that this is likely an OCaml issue, and should I escalate it there?
Frankly I don't know what to suggest. Do you really really really need to use timeout?
Because any boring fuel-based piece of code will surely work. It will suck, but will work.
In a normal situation, I would indeed avoid timeout
. However, deciding whether a tactic terminates or not is undecidable, and for Tactician I'm dealing with black-box tactics that I cannot just modify to use fuel. The only way I can think of to use fuel is to built this into the proof monad somehow. But that would be a rather invasive change.
This all is very crazy to me. The timeout function is not some super complicated function, and in any other programming language things like this just work. Surely other OCaml programs also use timeouts, and signals etc right (like the async
library for example)?
I don't think it would be crazy to add a timeout facility to the proof monad. It's a monad, you check the time in bind... Of course nested timeouts are harder, but a global one would be simple. @Pierre-Marie Pédrot what do you think?
expect catastrophic perfs
My little exprerience with these libraries is that they don't work reliably.
@Lasse it's not as if I didn't comment in the PR at the time that this code was surely wrong
Well, in this case it might be that the error is in the OCaml runtime, but still
@Pierre-Marie Pédrot I know, but I can not see how this code could possibly cause a mutex-lock to never return.
If you can emulate it as an OCaml standalone file it might be a good thing to report it on the OCaml bugtracker
The Ltac code looks simple enough to be emulated by a simple OCaml script
Yeah, I've been considering doing that, but this might be pretty difficult. Might give it a try anyway
Ok, maybe checking a bool (is timeout set) at every bind is too much, but maybe at each tclTHEN it is not too much
I mean, at each ;
(also it's breaking referential transparence)
@Enrico Tassi I don't think that is enough. For example, a rewrite ?H
can also easily hang
(but that's already the case with timeout alone)
internally it uses ;
or try
we already have active polling as Control.check_for_interrupt
we should probably rely on that
@Lasse I am running this on my machine to see whether it is reproducible
Seriously, timout is not a common thing. Even if you get 30% perf hit, I would not care
Check for interrupt is also fine, if it works
I confirm I can reproduce
(under CoqIDE at least)
Great, so it is not just me being crazy then...
If all tactics that could potentially loop use tclTHEN
under the covers, that would be fine with me. But I kind of suspect that there may also be other ways of creating a loop
Yes, it can just take a lot of time in conversion or unification
That's the reason we have polling for these functions
I assume that this polling is for the purpose of the user Aborting?
yes
I'd say "interrupting Coq"
Right, the big red button in Coqide
Lock reproduced in coqc as well
That is interesting, since coqc should not really have any threads
Exactly, that is why I'm thinking this must be an issue in the OCaml runtime.
FTR, coqidetop has threads, there is one per proof worker reading on a socket.
they are not "computing" threads, there is no parallelism
Indeed, my first suggestion is to try to reproduce with the simple compiler
that should be free of Thread
, but I will verify.
What do you mean with the simple compiler? Just coqc
?
Nope is a special compiler branch we have to actually remove the very tricky part that is the document manager [and as @Enrico Tassi pointed out sometimes uses Thread] from the equation
I actually maybe upstream it soon, as an internal tool [it is very simple]
coqc is still linked with the threads library, but should now spawn any thread
Ah, okay that would indeed be useful
https://github.com/ejgallego/coq/tree/simple_compiler
here it is , but I need to rebase etc... don't have the time today
likely it doesn't change anything but we can remove one variable
Right, certainly good to try. Thanks
Also try with OCaml 4.12
there is a PR, but you should be able to compile just disabling warn-error
Yeah, I wanted to do that, but I think Coq needs a patch right?
Ah okay, will try
let me rebase simple_compiler in a sec
done
Thanks! I'll hopefully try it out tomorrow. Should I just compile it and then run coqc
as normal?
Yup, tho coqc is more limited here
use make -f Makefile.dune world
to compile
I can still reproduce the issue with @Emilio Jesús Gallego Arias's simplified compiler. So I guess the next step will be to investigate the OCaml compiler. At least trying the newest version. If anyone has other ideas please let me know.
I am now thinking that this is the result of a bug in glibc
. See https://discuss.ocaml.org/t/is-there-a-known-recent-linux-locking-bug-that-affects-the-ocaml-runtime/6542/ https://sourceware.org/bugzilla/show_bug.cgi?id=25847#c8 https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800
Someone even verified this bug in TCLA+: https://probablydance.com/2020/10/31/using-tla-in-the-real-world-to-understand-a-glibc-bug/
This all sounds very similar to this issue here. Looks like not only OCaml is affected, but also Python .NET and more
Nice finding! Apparently the bug is still open upstream... The option of using something like Control.check_for_interrupt
seems to be the best one, at least you don't nee a glibc release in order to have your code work.
Yes, ubuntu has patched glibc recently but the glibc project itself is still debating the patch. I'll first verify if this bug is indeed the culprit (by trying an older version of glibc that didn't have the bug.
Mhm, I ran the script on a centos server with glibc 2.17 (which should be unaffected according to the bug reports), and I can still reproduce the problem. So this goes back to being mystery I guess.
@Lasse I've pushed a new version of the simple compiler that removes the need for thread really, there are still a few uses that could have weird effect, in particular Thread
is called in exception handlers which could create a mess. I just pushed the code to the branch and didn't do an analysis as maybe we can test first and analyze later.
@Emilio Jesús Gallego Arias With your new commit, I don't seem to be able to reproduce the problem anymore. Good call! Looking through the code you removed, the most suspect part is indeed in exinfo.ml
. Looks like those mutexes are somehow deadlocking. I'll try to take a closer look tomorrow.
Yeah that mess with Exninfo was always very suspicious to me, didn't have the time study it tho, I did perform some large cleanup of exn in Coq but didn't reach that far.
Please open a bug as I'm sure other people will understand quickly was going on.
See https://github.com/coq/coq/issues/13949
Last updated: Dec 06 2023 at 13:01 UTC