Notion of glueing in gramlib · Coq devs & plugin devs

Stream: Coq devs & plugin devs

Topic: Notion of glueing in gramlib

Hugo Herbelin (Jan 19 2023 at 15:39):

Hi, I experimented adding a notion of glued token to gramlib so that we can force some sequences to be parsed without space inbetween. Possible applications are:

forcing no space in e.g. @qualid or t%scope
parsing the characters composing a symbolic keyword separately but eventually forcing them to be contiguous, as in {| or [| which do not need any more to be lexed in one go

On the opposite we can force a space between some tokens, as e.g. in application f t, with perspectives such as e.g.:

parse f@t or f @ t as concatenation but f @qualid as application to a reference with maximal implicit arguments deactivated
parse f-1 or f - 1 as subtraction but f -1 as application to the negative number -1

This is still yet experimental and hackish. My message is to ask about preliminary feedback.

Currently, I do as follows:

concrete mlg syntax is of the form c = term; "%"; key = GLUED IDENT -> ...
in gramlib, there is a new optional boolean glued attached to each node (i.e. to Stoken, Snterm, etc.)
in gramlib, Rule.next takes an extra optional boolean (true means glue, false means requires a space, None means it does not matter)
the contiguity check is in LStream.peek which compares the loc of the peeked token with the previous one

Matthieu Sozeau (Jan 19 2023 at 16:57):

I like the "restrictions" for e.g. @qualid but I'm not sure the extra allowances could not be confusing.

Paolo Giarrusso (Jan 19 2023 at 17:38):

would gluing be opt-in, or could it replace existing keywords? getting this, even without new language features, would be a nice win.

Either way, are you able to distinguish between the foo ; [| bar] ltac and e.g. a notation like

Notation "[| P |]" := (only_provable P) (format "[|  P  |]").

where [| is a glued sequence?

Paolo Giarrusso (Jan 19 2023 at 17:38):

*glued token sequence

Paolo Giarrusso (Jan 19 2023 at 17:46):

I think scannerless GLR should be able to deal with this ambiguity, I wonder if/how much we'd lose here under LL(1). I guess my example might be easy, because ltac expressions and terms don't overlap, while ambiguity in term notations would remain a problem that users must deal with — but hopefully, this is still an improvement

Hugo Herbelin (Jan 19 2023 at 20:13):

@Matthieu Sozeau: I mentioned @ and - because their treatment is currently ad hoc and unsatisfactory. This is why I'm investigating if another simpler model, maybe more natural, is possible.

Shortly: to disambiguate between f (-1) and the binary f-1, we currently have to but -1 at a level higher than application, which contradicts the intuition that -1 is not less atomic than 1 or 0x1, etc. Similarly for f @qid which leads to add @qid to the constr level in ad hoc way. Also, the fact that -1 is currently not atomic makes in turn the introduction of negative numerals in custom notations ad hoc.

Matthieu Sozeau (Jan 19 2023 at 20:19):

Ah, thank you Hugo for the explanation, now I get it. I’m just surprised by the f@t being different from f @t.

Matthieu Sozeau (Jan 19 2023 at 20:20):

That’s due to application not requiring a space, right?

Matthieu Sozeau (Jan 19 2023 at 20:20):

I guess I’m so used to the current rules it’s hard to imagine another behavior, even if it might be more uniform and easier to learn

Hugo Herbelin (Jan 19 2023 at 20:34):

That’s due to application not requiring a space, right?

Currently, application does not require a space. You can write f# for instance if # is a notation. To distinguish between f@t (binary symbol @) and f @t (deactivation of maximal implicit arguments), yes, this would include expecting a space between the function and its argument.

Matthieu Sozeau (Jan 20 2023 at 07:44):

That's sensible I think. Agda does it that way IIRC.

Hugo Herbelin (Jan 20 2023 at 08:16):

@Paolo Giarrusso :

I think scannerless GLR should be able to deal with this ambiguity, I wonder if/how much we'd lose here under LL(1).

I'm not well aware of the literature. I may say something wrong, but it seems to me that the question of being able to parse [| both as a single token or as two tokens (depending on the context) is related to the question of having separate phases for lexing and parsing, and in particular in not losing at parsing time the information that [| has another valid tokenisation in two tokens.

The question of moving to another proof engine is a real question. In particular GLR would be more powerful. I know also that there are parsing engines which merge the lexing and parsing phases. Are some of them combining all of GLR, lexing/parsing fusion and extensibility? Otherwise, the current parsing engine, despite its current limitations, has the advantages that it is relatively simple, fully under our control and that we can make it evolve according to what we need.

Last updated: Apr 19 2024 at 11:02 UTC