Parsing interface · Coq devs & plugin devs

Stream: Coq devs & plugin devs

Topic: Parsing interface

Maxime Dénès (May 27 2020 at 13:15):

Hi, given a char Stream.t, I'd like to parse a vernac sentence and get back both the AST and the list of consumed tokens. Does anybody know how to do that?

Maxime Dénès (May 27 2020 at 13:15):

Maybe @Enrico Tassi already told me how to do so, but I forgot :(

Enrico Tassi (May 27 2020 at 13:20):

What I suggested to do is to take a string s, parse it, get the last loc from the AST, chop the string there, call the lexer on the substring. the parser only gives you the loc, AFAIK it does not keep consumed tokens.

Maxime Dénès (May 27 2020 at 13:24):

trying to understand step by step: the parser takes a stream, not a string, doesn't it?

Enrico Tassi (May 27 2020 at 13:26):

Sure, but you should have the originating string no?
in the doc manager

Enrico Tassi (May 27 2020 at 13:26):

IIRC you already chop the text using the parser

Enrico Tassi (May 27 2020 at 13:26):

what I suggested to do it lex again these strings

Maxime Dénès (May 27 2020 at 13:26):

yes and no, because the stream also maintains the current position in the string

Maxime Dénès (May 27 2020 at 13:27):

maybe there's a way to construct an already partially consumed stream, but I'm not sure

Maxime Dénès (May 27 2020 at 13:27):

although yeah I could consume it by hand again

Enrico Tassi (May 27 2020 at 13:27):

You could use the Stream API to get the end offset of last consumed token, not very different from the end-loc of the sentence

Maxime Dénès (May 27 2020 at 13:28):

so the next question is: how to call the lexer?

Enrico Tassi (May 27 2020 at 13:28):

IIRC there is a API to lex a string (it is used by the diffs IIRC)

Maxime Dénès (May 27 2020 at 13:29):

I tried to find it, but not trivial to see in parsing/cLexer.mli

Maxime Dénès (May 27 2020 at 13:29):

the doc does not seem to mention how to call the lexer...

Maxime Dénès (May 27 2020 at 13:30):

maybe I'm just blind

Maxime Dénès (May 27 2020 at 13:33):

ah yes, proof diffs seem to have something similar, but there's a comment saying that the author did not understand the side effects done by the lexer

Maxime Dénès (May 27 2020 at 13:35):

also, the lexing function does not seem to return the locations, does it? I remember we struggled with it some time ago to implement parsing hacks.

Enrico Tassi (May 27 2020 at 13:35):

printing/proof_diffs.ml tokenize_string

Enrico Tassi (May 27 2020 at 13:36):

yes, it is not in CLexer... why would it be

Enrico Tassi (May 27 2020 at 13:36):

the locs are part of the function returned by CLexer when you build the stream IIRC

Enrico Tassi (May 27 2020 at 13:39):

see gramlib/plexing.mli tok_fun, it returns two things, the latter is a way to get the loc of a token

Maxime Dénès (May 27 2020 at 13:39):

but the lexer of Coq keeps that info only locally, no?

Maxime Dénès (May 27 2020 at 13:40):

see CLexer.func

Maxime Dénès (May 27 2020 at 13:40):

Does it mean I should change Coq's lexer? define a new one?

Maxime Dénès (May 27 2020 at 13:42):

Enrico Tassi said:

the locs are part of the function returned by CLexer when you build the stream IIRC

This is the part I don't see, sorry if I'm slow, it seemed to me CLexer.Lexer was hiding these locations.

Maxime Dénès (May 27 2020 at 13:44):

Ah sorry, now I see, it returns the locs separately as a tabulated function

Maxime Dénès (May 27 2020 at 13:47):

I see that tokenize_string saves and restores the lexer state at each call to the lexer, why is this required?

Maxime Dénès (May 27 2020 at 13:59):

Ok, I managed to call the lexer, but I'm still stuck with the other problem: what I have initially is a stream, not a string. If I call the parser on it, the stream is mutated and I lost the parsed characters, so I can't retokenize...

Enrico Tassi (May 27 2020 at 14:13):

I've no idea. To me the state of the lexer is the set of keywords... so it should not change...

Enrico Tassi (May 27 2020 at 14:14):

Maxime Dénès said:

Ok, I managed to call the lexer, but I'm still stuck with the other problem: what I have initially is a stream, not a string. If I call the parser on it, the stream is mutated and I lost the parsed characters, so I can't retokenize...

Where are you? I mean, in the doc manager you also have the string no?

Enrico Tassi (May 27 2020 at 14:14):

Are you trying to implement the API in the parser of Coq? Maybe there you are too low level to do as I suggested.

Maxime Dénès (May 27 2020 at 14:15):

I am in the document manager indeed, which maintains the stream, not the string

Maxime Dénès (May 27 2020 at 14:15):

I could switch to maintaining the string, but how do I do so while not rescanning the beginning each time I want to parse a new sentence?

Maxime Dénès (May 27 2020 at 14:16):

(to be clear, the document manager does maintain the string of the whole text, but when parsing a sentence, then the next, it is the stream which is passed, so as not to rescan the text from the beginning)

Maxime Dénès (May 27 2020 at 14:20):

I'm trying to not touch the parser of Coq ;)

Enrico Tassi (May 27 2020 at 14:30):

Maxime Dénès said:

(to be clear, the document manager does maintain the string of the whole text, but when parsing a sentence, then the next, it is the stream which is passed, so as not to rescan the text from the beginning)

premature optimization? You can surely build a stream in o(1) out of a string an an offset. I don't know if there is already an API for that, but it is surelly possible.

Enrico Tassi (May 27 2020 at 14:39):

You can copy the code from https://github.com/ocaml/ocaml/blob/trunk/stdlib/stream.ml#L166 and use you offset instead of 0

Emilio Jesús Gallego Arias (May 27 2020 at 14:40):

At some you'll have to tweak the parser / lexer interface if you want proper incremental parsing. IMO the easiest way is to follow Menhir's API.

SerAPI does include some code for tokenization, etc... it needs to use ugly hacks such as manually slicing the strings

Enrico Tassi (May 27 2020 at 14:41):

Does Menhir give you access to the list of tokens that were parsed before stopping?

Emilio Jesús Gallego Arias (May 27 2020 at 14:41):

In fact it could be interesting to upstream some of that code

Emilio Jesús Gallego Arias (May 27 2020 at 14:43):

menhir gives the lexer control back using a continuation IIRC

Emilio Jesús Gallego Arias (May 27 2020 at 14:43):

keeping the list of tokens would be interesting indeed, but seems a bit orthongonal w.r.t parsing resumption

Emilio Jesús Gallego Arias (May 27 2020 at 14:43):

sertok has to indeed perform a separate call to the lexer and the parser :S

Emilio Jesús Gallego Arias (May 27 2020 at 14:44):

We didn't make the Coq parser return the list of located tokens as the were afraid of the performance impact

Emilio Jesús Gallego Arias (May 27 2020 at 14:45):

But that functionality would be quite useful for UIs

Emilio Jesús Gallego Arias (May 27 2020 at 14:48):

I am not sure how merlin is handling lexbuf tho, which while imperative keeps the full buffer in the object [at least for strings

Maxime Dénès (May 31 2020 at 08:54):

Enrico Tassi said:

premature optimization? You can surely build a stream in o(1) out of a string an an offset. I don't know if there is already an API for that, but it is surelly possible.

I don't think this is premature optimization, I just followed the interface of the parser (wasn't trying to optimize).

I tried your approach, but now I get wrong locations in ASTs generated by the parser (I guess it is based on Stream.count). How to fix it?

Maxime Dénès (May 31 2020 at 10:45):

Ok, I managed to do it with the same idea as sertok (use the regular parsing interface, and extract a substring to relex).

Emilio Jesús Gallego Arias (May 31 2020 at 15:06):

Beware that there seem to be some bugs w.r.t. the reported location by the parser/lexer, specially when notations are involved, nothing too serious but depending on your use case they could create a mess.

Jason Gross (May 31 2020 at 23:38):

Emilio Jesús Gallego Arias said:

We didn't make the Coq parser return the list of located tokens as the were afraid of the performance impact

If you want some benchmarks of parsing, I can give you some benchmarks of parsing (+ typechecking). I have some benchmarks cut off at the point where either the parser or the underlying term AST stack overflows or something like that.

Anton Trunov (Jun 02 2020 at 14:45):

@Jason Gross Are the benchmarks publicly available?

I recently did some stress testing of a System F-based language I'm working on,
so I decided to try some of the examples on Coq too.

E.g. the following simple test stackoverflows:

echo "Definition foo :=\n" > coq_stress.v; perl -e 'print "let x := 0 in\n" x 6288' >> coq_stress.v; echo "x.\nCheck foo." >> coq_stress.v
ulimit -n 24000; coqc coq_stress.v

6288 is precisely the number of lines with let x := 0 in when I start seeing

File "", line 1, characters 88046-88047:
Error: Stack overflow.

Jason Gross (Jun 02 2020 at 19:32):

https://github.com/JasonGross/doctoral-thesis/blob/master/performance-experiments/make_destruct_nested_record.sh
https://github.com/JasonGross/doctoral-thesis/blob/master/performance-experiments-8-9/nested_lambda_same_name.sh
https://github.com/JasonGross/doctoral-thesis/blob/master/performance-experiments-8-9/nested_lambda_no_name.sh
https://github.com/JasonGross/doctoral-thesis/blob/master/performance-experiments-8-9/nested_lambda_different_name.sh

Jason Gross (Jun 02 2020 at 19:34):

@Anton Trunov Where's the stack overflow? What happens if you fill in all the typing information rather than creating evars? (let x : nat := 0 in ...)

Anton Trunov (Jun 02 2020 at 20:24):

It's hard to tell where the stackoverflow is happening exactly. Probably in the parser given that even Coq's calculation of line number is not correct here and the fact that the problem persists even if I remove Check foo at the end. Adding : nat does not change anything except the character count is obviously different now.

Anton Trunov (Jun 02 2020 at 20:29):

Thanks for the links, @Jason Gross

Jason Gross (Jun 02 2020 at 21:13):

@Anton Trunov You can get a bit more information. With 8.11.1:

$ ulimit -n 24000; coqc -q -debug coq_stress.v
File "", line 1, characters 79348-79349:
Error:
Stack overflow.
frame @ file "toplevel/coqc.ml", line 67, characters 4-25
frame @ file "toplevel/coqc.ml", line 45, characters 2-81
frame @ file "list.ml", line 100, characters 12-15
frame @ file "toplevel/ccompile.ml", line 214, characters 2-39
frame @ file "toplevel/ccompile.ml", line 148, characters 18-89
frame @ file "toplevel/vernac.ml", line 170, characters 30-88
raise @ file "clib/exninfo.ml", line 65, characters 2-15
frame @ file "toplevel/vernac.ml", line 114, characters 6-19
frame @ file "toplevel/vernac.ml", line 97, characters 6-89
raise @ file "clib/exninfo.ml", line 65, characters 2-15
frame @ file "lib/flags.ml", line 17, characters 14-17
raise @ unknown
frame @ file "vernac/vernacstate.ml", line 22, characters 10-35
raise @ file "clib/exninfo.ml", line 65, characters 2-15

Jason Gross (Jun 02 2020 at 21:15):

And the line information is correct, I believe; it reports the line number where the sentence begins, and this is saying that it occurs in the sentence starting on line 1. (There's probably a bug to be reported here, though.)

Emilio Jesús Gallego Arias (Jun 03 2020 at 00:11):

This deserves a double bug report, first for the lost backtrace, second for the stack overflow

Emilio Jesús Gallego Arias (Jun 03 2020 at 00:11):

the stack overflow happens in

Emilio Jesús Gallego Arias (Jun 03 2020 at 00:12):

Raised by primitive operation at file "list.ml", line 180, characters 12-23
Called from file "parsing/cLexer.ml", line 401, characters 25-47
Called from file "parsing/cLexer.ml", line 665, characters 19-30
Called from file "parsing/cLexer.ml", line 819, characters 26-48
Called from file "stream.ml", line 86, characters 21-35
Called from file "stream.ml" (inlined), line 95, characters 14-25
Called from file "gramlib/grammar.ml", line 1390, characters 10-26
Called from file "gramlib/grammar.ml", line 1202, characters 28-37
Called from file "gramlib/grammar.ml", line 1423, characters 27-36
Called from file "gramlib/grammar.ml", line 1202, characters 28-37
Called from file "gramlib/grammar.ml", line 1423, characters 27-36
Called from file "gramlib/grammar.ml", line 1139, characters 10-17
Called from file "gramlib/grammar.ml" (inlined), line 1267, characters 15-61
Called from file "gramlib/grammar.ml", line 1269, characters 23-37
Called from file "gramlib/grammar.ml", line 1274, characters 17-31
Called from file "gramlib/grammar.ml", line 1175, characters 21-30
Called from file "gramlib/grammar.ml", line 1423, characters 27-36
Called from file "gramlib/grammar.ml", line 1175, characters 21-30
Called from file "gramlib/grammar.ml", line 1224, characters 6-15
Called from file "gramlib/grammar.ml" (inlined), line 1198, characters 19-57
Called from file "gramlib/grammar.ml", line 1205, characters 29-43
Called from file "gramlib/grammar.ml", line 1224, characters 6-15

Emilio Jesús Gallego Arias (Jun 03 2020 at 00:15):

So indeed that's deep in the parser mutual loop

Anton Trunov (Jun 03 2020 at 12:07):

Thank you for looking into that. Reported as https://github.com/coq/coq/issues/12438

Last updated: Apr 19 2024 at 13:02 UTC