Hi, given a char Stream.t
, I'd like to parse a vernac sentence and get back both the AST and the list of consumed tokens. Does anybody know how to do that?
Maybe @Enrico Tassi already told me how to do so, but I forgot :(
What I suggested to do is to take a string s, parse it, get the last loc from the AST, chop the string there, call the lexer on the substring. the parser only gives you the loc, AFAIK it does not keep consumed tokens.
trying to understand step by step: the parser takes a stream, not a string, doesn't it?
Sure, but you should have the originating string no?
in the doc manager
IIRC you already chop the text using the parser
what I suggested to do it lex again these strings
yes and no, because the stream also maintains the current position in the string
maybe there's a way to construct an already partially consumed stream, but I'm not sure
although yeah I could consume it by hand again
You could use the Stream API to get the end offset of last consumed token, not very different from the end-loc of the sentence
so the next question is: how to call the lexer?
IIRC there is a API to lex a string (it is used by the diffs IIRC)
I tried to find it, but not trivial to see in parsing/cLexer.mli
the doc does not seem to mention how to call the lexer...
maybe I'm just blind
ah yes, proof diffs seem to have something similar, but there's a comment saying that the author did not understand the side effects done by the lexer
also, the lexing function does not seem to return the locations, does it? I remember we struggled with it some time ago to implement parsing hacks.
printing/proof_diffs.ml tokenize_string
yes, it is not in CLexer... why would it be
the locs are part of the function returned by CLexer when you build the stream IIRC
see gramlib/plexing.mli tok_fun, it returns two things, the latter is a way to get the loc of a token
but the lexer of Coq keeps that info only locally, no?
see CLexer.func
Does it mean I should change Coq's lexer? define a new one?
Enrico Tassi said:
the locs are part of the function returned by CLexer when you build the stream IIRC
This is the part I don't see, sorry if I'm slow, it seemed to me CLexer.Lexer
was hiding these locations.
Ah sorry, now I see, it returns the locs separately as a tabulated function
I see that tokenize_string
saves and restores the lexer state at each call to the lexer, why is this required?
Ok, I managed to call the lexer, but I'm still stuck with the other problem: what I have initially is a stream, not a string. If I call the parser on it, the stream is mutated and I lost the parsed characters, so I can't retokenize...
I've no idea. To me the state of the lexer is the set of keywords... so it should not change...
Maxime Dénès said:
Ok, I managed to call the lexer, but I'm still stuck with the other problem: what I have initially is a stream, not a string. If I call the parser on it, the stream is mutated and I lost the parsed characters, so I can't retokenize...
Where are you? I mean, in the doc manager you also have the string no?
Are you trying to implement the API in the parser of Coq? Maybe there you are too low level to do as I suggested.
I am in the document manager indeed, which maintains the stream, not the string
I could switch to maintaining the string, but how do I do so while not rescanning the beginning each time I want to parse a new sentence?
(to be clear, the document manager does maintain the string of the whole text, but when parsing a sentence, then the next, it is the stream which is passed, so as not to rescan the text from the beginning)
I'm trying to not touch the parser of Coq ;)
Maxime Dénès said:
(to be clear, the document manager does maintain the string of the whole text, but when parsing a sentence, then the next, it is the stream which is passed, so as not to rescan the text from the beginning)
premature optimization? You can surely build a stream in o(1) out of a string an an offset. I don't know if there is already an API for that, but it is surelly possible.
You can copy the code from https://github.com/ocaml/ocaml/blob/trunk/stdlib/stream.ml#L166 and use you offset instead of 0
At some you'll have to tweak the parser / lexer interface if you want proper incremental parsing. IMO the easiest way is to follow Menhir's API.
SerAPI does include some code for tokenization, etc... it needs to use ugly hacks such as manually slicing the strings
Does Menhir give you access to the list of tokens that were parsed before stopping?
In fact it could be interesting to upstream some of that code
menhir gives the lexer control back using a continuation IIRC
keeping the list of tokens would be interesting indeed, but seems a bit orthongonal w.r.t parsing resumption
sertok
has to indeed perform a separate call to the lexer and the parser :S
We didn't make the Coq parser return the list of located tokens as the were afraid of the performance impact
But that functionality would be quite useful for UIs
I am not sure how merlin is handling lexbuf tho, which while imperative keeps the full buffer in the object [at least for strings
Enrico Tassi said:
premature optimization? You can surely build a stream in o(1) out of a string an an offset. I don't know if there is already an API for that, but it is surelly possible.
I don't think this is premature optimization, I just followed the interface of the parser (wasn't trying to optimize).
I tried your approach, but now I get wrong locations in ASTs generated by the parser (I guess it is based on Stream.count
). How to fix it?
Ok, I managed to do it with the same idea as sertok (use the regular parsing interface, and extract a substring to relex).
Beware that there seem to be some bugs w.r.t. the reported location by the parser/lexer, specially when notations are involved, nothing too serious but depending on your use case they could create a mess.
Emilio Jesús Gallego Arias said:
We didn't make the Coq parser return the list of located tokens as the were afraid of the performance impact
If you want some benchmarks of parsing, I can give you some benchmarks of parsing (+ typechecking). I have some benchmarks cut off at the point where either the parser or the underlying term AST stack overflows or something like that.
@Jason Gross Are the benchmarks publicly available?
I recently did some stress testing of a System F-based language I'm working on,
so I decided to try some of the examples on Coq too.
E.g. the following simple test stackoverflows:
echo "Definition foo :=\n" > coq_stress.v; perl -e 'print "let x := 0 in\n" x 6288' >> coq_stress.v; echo "x.\nCheck foo." >> coq_stress.v
ulimit -n 24000; coqc coq_stress.v
6288
is precisely the number of lines with let x := 0 in
when I start seeing
File "", line 1, characters 88046-88047:
Error: Stack overflow.
https://github.com/JasonGross/doctoral-thesis/blob/master/performance-experiments/make_destruct_nested_record.sh
https://github.com/JasonGross/doctoral-thesis/blob/master/performance-experiments-8-9/nested_lambda_same_name.sh
https://github.com/JasonGross/doctoral-thesis/blob/master/performance-experiments-8-9/nested_lambda_no_name.sh
https://github.com/JasonGross/doctoral-thesis/blob/master/performance-experiments-8-9/nested_lambda_different_name.sh
@Anton Trunov Where's the stack overflow? What happens if you fill in all the typing information rather than creating evars? (let x : nat := 0 in ...
)
It's hard to tell where the stackoverflow is happening exactly. Probably in the parser given that even Coq's calculation of line number is not correct here and the fact that the problem persists even if I remove Check foo
at the end. Adding : nat
does not change anything except the character count is obviously different now.
Thanks for the links, @Jason Gross
@Anton Trunov You can get a bit more information. With 8.11.1:
$ ulimit -n 24000; coqc -q -debug coq_stress.v
File "", line 1, characters 79348-79349:
Error:
Stack overflow.
frame @ file "toplevel/coqc.ml", line 67, characters 4-25
frame @ file "toplevel/coqc.ml", line 45, characters 2-81
frame @ file "list.ml", line 100, characters 12-15
frame @ file "toplevel/ccompile.ml", line 214, characters 2-39
frame @ file "toplevel/ccompile.ml", line 148, characters 18-89
frame @ file "toplevel/vernac.ml", line 170, characters 30-88
raise @ file "clib/exninfo.ml", line 65, characters 2-15
frame @ file "toplevel/vernac.ml", line 114, characters 6-19
frame @ file "toplevel/vernac.ml", line 97, characters 6-89
raise @ file "clib/exninfo.ml", line 65, characters 2-15
frame @ file "lib/flags.ml", line 17, characters 14-17
raise @ unknown
frame @ file "vernac/vernacstate.ml", line 22, characters 10-35
raise @ file "clib/exninfo.ml", line 65, characters 2-15
And the line information is correct, I believe; it reports the line number where the sentence begins, and this is saying that it occurs in the sentence starting on line 1. (There's probably a bug to be reported here, though.)
This deserves a double bug report, first for the lost backtrace, second for the stack overflow
the stack overflow happens in
Raised by primitive operation at file "list.ml", line 180, characters 12-23
Called from file "parsing/cLexer.ml", line 401, characters 25-47
Called from file "parsing/cLexer.ml", line 665, characters 19-30
Called from file "parsing/cLexer.ml", line 819, characters 26-48
Called from file "stream.ml", line 86, characters 21-35
Called from file "stream.ml" (inlined), line 95, characters 14-25
Called from file "gramlib/grammar.ml", line 1390, characters 10-26
Called from file "gramlib/grammar.ml", line 1202, characters 28-37
Called from file "gramlib/grammar.ml", line 1423, characters 27-36
Called from file "gramlib/grammar.ml", line 1202, characters 28-37
Called from file "gramlib/grammar.ml", line 1423, characters 27-36
Called from file "gramlib/grammar.ml", line 1139, characters 10-17
Called from file "gramlib/grammar.ml" (inlined), line 1267, characters 15-61
Called from file "gramlib/grammar.ml", line 1269, characters 23-37
Called from file "gramlib/grammar.ml", line 1274, characters 17-31
Called from file "gramlib/grammar.ml", line 1175, characters 21-30
Called from file "gramlib/grammar.ml", line 1423, characters 27-36
Called from file "gramlib/grammar.ml", line 1175, characters 21-30
Called from file "gramlib/grammar.ml", line 1224, characters 6-15
Called from file "gramlib/grammar.ml" (inlined), line 1198, characters 19-57
Called from file "gramlib/grammar.ml", line 1205, characters 29-43
Called from file "gramlib/grammar.ml", line 1224, characters 6-15
So indeed that's deep in the parser mutual loop
Thank you for looking into that. Reported as https://github.com/coq/coq/issues/12438
Last updated: Dec 06 2023 at 13:01 UTC