Hi, I experimented adding a notion of glued token to gramlib so that we can force some sequences to be parsed without space inbetween. Possible applications are:
@qualid
or t%scope
{|
or [|
which do not need any more to be lexed in one goOn the opposite we can force a space between some tokens, as e.g. in application f t
, with perspectives such as e.g.:
f@t
or f @ t
as concatenation but f @qualid
as application to a reference with maximal implicit arguments deactivatedf-1
or f - 1
as subtraction but f -1
as application to the negative number -1This is still yet experimental and hackish. My message is to ask about preliminary feedback.
Currently, I do as follows:
c = term; "%"; key = GLUED IDENT -> ...
glued
attached to each node
(i.e. to Stoken
, Snterm
, etc.)Rule.next
takes an extra optional boolean (true
means glue, false
means requires a space, None
means it does not matter)LStream.peek
which compares the loc of the peeked token with the previous oneI like the "restrictions" for e.g. @qualid
but I'm not sure the extra allowances could not be confusing.
would gluing be opt-in, or could it replace existing keywords? getting this, even without new language features, would be a nice win.
Either way, are you able to distinguish between the foo ; [| bar]
ltac and e.g. a notation like
Notation "[| P |]" := (only_provable P) (format "[| P |]").
where [|
is a glued sequence?
*glued token sequence
I think scannerless GLR should be able to deal with this ambiguity, I wonder if/how much we'd lose here under LL(1). I guess my example might be easy, because ltac expressions and terms don't overlap, while ambiguity in term notations would remain a problem that users must deal with — but hopefully, this is still an improvement
@Matthieu Sozeau: I mentioned @
and -
because their treatment is currently ad hoc and unsatisfactory. This is why I'm investigating if another simpler model, maybe more natural, is possible.
Shortly: to disambiguate between f (-1)
and the binary f-1
, we currently have to but -1
at a level higher than application, which contradicts the intuition that -1
is not less atomic than 1
or 0x1
, etc. Similarly for f @qid
which leads to add @qid
to the constr
level in ad hoc way. Also, the fact that -1
is currently not atomic makes in turn the introduction of negative numerals in custom notations ad hoc.
Ah, thank you Hugo for the explanation, now I get it. I’m just surprised by the f@t being different from f @t.
That’s due to application not requiring a space, right?
I guess I’m so used to the current rules it’s hard to imagine another behavior, even if it might be more uniform and easier to learn
That’s due to application not requiring a space, right?
Currently, application does not require a space. You can write f#
for instance if #
is a notation. To distinguish between f@t
(binary symbol @
) and f @t
(deactivation of maximal implicit arguments), yes, this would include expecting a space between the function and its argument.
That's sensible I think. Agda does it that way IIRC.
@Paolo Giarrusso :
I think scannerless GLR should be able to deal with this ambiguity, I wonder if/how much we'd lose here under LL(1).
I'm not well aware of the literature. I may say something wrong, but it seems to me that the question of being able to parse [|
both as a single token or as two tokens (depending on the context) is related to the question of having separate phases for lexing and parsing, and in particular in not losing at parsing time the information that [|
has another valid tokenisation in two tokens.
The question of moving to another proof engine is a real question. In particular GLR would be more powerful. I know also that there are parsing engines which merge the lexing and parsing phases. Are some of them combining all of GLR, lexing/parsing fusion and extensibility? Otherwise, the current parsing engine, despite its current limitations, has the advantages that it is relatively simple, fully under our control and that we can make it evolve according to what we need.
Last updated: Jun 09 2023 at 07:01 UTC