Stream: Coq users

Topic: Regex representation


view this post on Zulip Julin S (Oct 29 2022 at 05:27):

I was trying to represent regular expressions over strings as propositions.

Did this:

Require Import String.
Open Scope string.

Definition char := string.
Inductive re : Type :=
| Atom : char -> re
| Concat : re -> re -> re
| Alt : re -> re -> re
| Dot : re.

(* Indexing starts from zero. Empty string on invalid index. *)
Definition stridx (i:nat) (s:string) : char :=
  String.substring i 1 s.
Definition strdrop (n:nat) (s:string) : string :=
  String.substring n ((String.length s)-1) s.

Fixpoint reDenote (r:re) : string -> Prop :=
  match r with
  | Atom ch => fun s:string => stridx 0 s = ch
  | Dot => fun s:string => stridx 0 s = stridx 0 s
  | Concat r1 r2 => fun s:string =>
      ((reDenote r1) s) /\ ((reDenote r2) (strdrop 1 s))
  | Alt r1 r2 => fun s:string =>
      ((reDenote r1) s) \/ ((reDenote r2) s)
  end.

This looks okay:

Infix ":-" := Concat (at level 51, right associativity).

Compute reDenote ((Atom "h") :- (Atom "i") :- (Atom "y") :- (Atom "a")) "hiya".
(*
  = "h" = "h" /\ "i" = "i" /\ "y" = "y" /\ "a" = "a"
  : Prop
*)

But how can we add zero or more number of repetitions? Kleene star thing?

I was thinking to have a constructor like :

Star : re -> re

But how can we make it a proposition? I mean, it's like a possible infinite stream, but needn't be.

view this post on Zulip Julien Puydt (Oct 29 2022 at 05:38):

Shouldn't the Inductive re contain all building blocks for regex, including star and repetitions?

view this post on Zulip Julin S (Oct 29 2022 at 05:41):

Yeah, but I was stuck at Star. I got like:

Inductive re : Type :=
| Atom : char -> re
| Concat : re -> re -> re
| Alt : re -> re -> re
| Star : re -> re
| Dot : re
| Epsilon : re.

But couldn't get the translation to Prop right.

Fixpoint reDenote (r:re) : string -> Prop :=
  match r with
  | Atom ch => fun s:string => stridx 0 s = ch
  | Dot => fun s:string => stridx 0 s = stridx 0 s
  | Epsilon => fun s:string => True
  | Star r => fun s:string =>
      reDenote (Alt (Concat r (Star r)) Epsilon) s
  | Concat r1 r2 => fun s:string =>
      ((reDenote r1) s) /\ ((reDenote r2) (strdrop 1 s))
  | Alt r1 r2 => fun s:string =>
      ((reDenote r1) s) \/ ((reDenote r2) s)
  end.

The argument to fixpoint is not getting smaller for Star.

view this post on Zulip Julin S (Oct 29 2022 at 05:41):

Error:

Error:
Recursive definition of reDenote is ill-formed.
In environment
reDenote : re -> string -> Prop
r : re
r0 : re
s : string
Recursive call to reDenote has principal argument equal to
"Alt (Concat r0 (Star r0)) Epsilon" instead of
"r0".

view this post on Zulip Julien Puydt (Oct 29 2022 at 05:55):

I don't get what this function is supposed to do

view this post on Zulip Julien Puydt (Oct 29 2022 at 05:57):

Oh, it's supposed to check if the string satisfies the regex?

view this post on Zulip Julien Puydt (Oct 29 2022 at 05:58):

Then even without star, I don't think that worked : the Concat branch looks highly suspicious.

view this post on Zulip Julien Puydt (Oct 29 2022 at 05:59):

And by the way, if you're actually computing a truth value, it should definitely give a bool -- Prop is for more abstract matters.

view this post on Zulip Julien Puydt (Oct 29 2022 at 06:00):

As far as I know (I'm a mathematician, not a computer scientist), the usual way to go with regex is:

  1. compile the regex to some other form (finite automata) ;
  2. use the compiled form to check strings.

view this post on Zulip Julin S (Oct 29 2022 at 06:36):

Julien Puydt said:

Oh, it's supposed to check if the string satisfies the regex?

Yeah, that was what I was after.

view this post on Zulip Julin S (Oct 29 2022 at 06:37):

Julien Puydt said:

Then even without star, I don't think that worked : the Concat branch looks highly suspicious.

Why? Is it because of the strdrop?

view this post on Zulip Julin S (Oct 29 2022 at 06:38):

Julien Puydt said:

As far as I know (I'm a mathematician, not a computer scientist), the usual way to go with regex is:

  1. compile the regex to some other form (finite automata) ;
  2. use the compiled form to check strings.

If the regex can be made as a re value, it would be like having the automata, right?

view this post on Zulip Julien Puydt (Oct 29 2022 at 06:56):

The strdroponly lets a single character go... that means you're only able to recognize something trivial first then something else

view this post on Zulip Julien Puydt (Oct 29 2022 at 06:58):

Sure, you can do something like:

Definition re_check (r: re) (s: string) := automata_check (automata_of_re re) s.

but you'll still need the automata.

And the reason people don't do it like this is performance: if you want to check many strings with the same regex, you better not recompute the automata each time!

view this post on Zulip Pierre Castéran (Oct 29 2022 at 07:59):

If you want to define a proposition "the string s satisfies the regexp r", you may write an inductive definition, like:

Inductive reDenote : re -> string -> Prop :=
| Atom_denote ch : reDenote (Atom ch) (String ch EmptyString)
| Dot_denote  : forall ch, reDenote Dot (String ch EmptyString)
| Epsilon_denote  : reDenote Epsilon EmptyString
| Star_denote r : forall s,
    reDenote (Alt (Concat r (Star r)) Epsilon) s ->
    reDenote (Star r) s
| Concat_denote r1 r2: forall s1 s2 s, reDenote r1 s1 ->
                                       reDenote r2 s2 ->
                                       s = append s1 s2 ->
                                       reDenote (Concat r1 r2) s
| Alt_denote  r1 r2: forall s, (reDenote r1 s \/ reDenote r2 s) ->
                               reDenote (Alt r1 r2) s.

(to be verified and/or improved)

Then, you may define recognizers for reg-exp (maybe with deterministic automata) and prove they are correct w.r.t. this definition.

Apart from the fixpoint issue, I didn't understand some definitions like Definition char := string., and the clauses for Atom, Dot, Epsilonand Concat(problems with string length (should be 1 for Atom, Dot, 0 for Epsilon, ... ?)

view this post on Zulip Julien Puydt (Oct 29 2022 at 08:26):

@Pierre Castéran I don't get how forallappears in this context... I wouldn't have expected universal quantification there, but existential.

view this post on Zulip Pierre Castéran (Oct 29 2022 at 09:32):

Julien Puydt said:

Pierre Castéran I don't get how forallappears in this context... I wouldn't have expected universal quantification there, but existential.

Which variables do you would existentially quantify? For instance, s1and s2in Concat_denote? I think it would introduce in the definition some more connectives.

Although it doesn't prove the correctness of my definition, here is a small example.

Goal reDenote (Concat Dot (Atom  one))
              (String zero (String one  EmptyString)).
  econstructor 5.
  constructor 2.
  constructor 1.
  reflexivity.
Qed.

Btw, I like the Prolog style too.

(** reRest r s s':  one can  scan a prefix of s which matches r,
    and s' is the remaining suffix *)

Inductive reRest : re -> string -> string -> Prop :=
| reRest1 ch : forall s, reRest  (Atom ch) (String ch s) s
| reRest2  : forall ch s, reRest Dot (String ch s) s
| reRest3 : forall s, reRest Epsilon s s
| reRest4 r : forall s, reRest (Star r) s s
| reRest5 r : forall s s' s'', reRest r s s' ->
                           reRest (Star r) s' s'' ->
                           reRest (Star r) s s''
| reRest6 r1 r2:  forall s s1 s2 , reRest r1 s s1 ->
                                   reRest r2 s1 s2 ->
                                   reRest (Concat r1 r2) s s2

| reRest7 r1 r2 : forall s s', reRest r1 s s' ->
                               reRest (Alt r1 r2) s s'
| reRest8 r1 r2 : forall s s', reRest r2 s s' ->
                               reRest (Alt r1 r2) s s'.

Definition reDenote' r s := reRest r s EmptyString.

view this post on Zulip Julien Puydt (Oct 29 2022 at 09:45):

Yes, I would ; but I think I wasn't looking at it right : I considered that to prove reDenote (Concat r1 r2) s one needs to prove there exists s1 s2 ; while what you have expressed is that all those implications work -- you're not forcing to use all of them, any will do! So it's correct and I get your point.

view this post on Zulip Gaëtan Gilbert (Oct 29 2022 at 11:20):

remember the definition of exists:

Inductive ex (A:Type) (P:A -> Prop) : Prop :=
  ex_intro : forall x:A, P x -> ex A P.

ie forall in a constructor type makes the inductive an existential

view this post on Zulip Karl Palmskog (Oct 29 2022 at 11:21):

for validation that your regexp representation is correct, I recommend proving it equivalent to the one here: https://github.com/coq-community/reglang

view this post on Zulip Julin S (Oct 30 2022 at 15:56):

I was also trying to understand the regular expressions done in https://github.com/coq-contribs/regexp

I tried

Compute matches (And (Char "a") (Char "b")) "ab"%string.

But that gave false.

I thought matches (type: RegExp -> String -> bool) meant that it would take a regex and string and would say if the regex matched that string or not. Isn't that so?

view this post on Zulip Gaëtan Gilbert (Oct 30 2022 at 15:57):

you're thinking of Cat not And
https://github.com/coq-contribs/regexp/blob/master/Definitions.v#L2-L10

view this post on Zulip Julin S (Oct 30 2022 at 16:12):

Oh, right. :woozy_face:

view this post on Zulip Julin S (Nov 03 2022 at 10:15):

Pierre Castéran said:

Apart from the fixpoint issue, I didn't understand some definitions like Definition char := string.

I was trying to have a representation for a single character string (but of course with nothing to stop it from being more than one character). I didn't really know how to do it. I see that Ascii.ascii is better choice.

view this post on Zulip Julin S (Nov 03 2022 at 10:16):


And in

| Atom_denote ch : reDenote (Atom ch) (String ch EmptyString)
| Dot_denote  : forall ch, reDenote Dot (String ch EmptyString)

Is there any significance in having ch of Atom_denote 'inside' when compared to that of Dot_denote where it's 'outside'?

view this post on Zulip Julin S (Nov 03 2022 at 10:25):


And does having Concat_denote like:

| ConcatDe r1 r2: forall s1 s2:string,
       reDenote r1 s1 -> reDenote r2 s2
    -> reDenote (Concat r1 r2) (append s1 s2)

have any disadvantage?

view this post on Zulip Gaëtan Gilbert (Nov 03 2022 at 10:36):

Is there any significance in having ch of Atom_denote 'inside' when compared to that of Dot_denote where it's 'outside'?

absolutely 0 significance, it's just syntax sugar

view this post on Zulip Gaëtan Gilbert (Nov 03 2022 at 10:36):

And does having Concat_denote like:

as opposed to what?

view this post on Zulip Julin S (Nov 03 2022 at 10:42):

Gaëtan Gilbert said:

And does having Concat_denote like:

as opposed to what?

Instead of

| Concat_denote r1 r2: forall s1 s2 s, reDenote r1 s1 ->
                                       reDenote r2 s2 ->
                                       s = append s1 s2 ->
                                       reDenote (Concat r1 r2) s

view this post on Zulip Gaëtan Gilbert (Nov 03 2022 at 10:57):

not sure what the difference in practice would be
probably not very impactful?

view this post on Zulip Julin S (Nov 03 2022 at 11:16):

I got this:

Require Import Ascii String.

Inductive re : Set :=
| Atom: Ascii.ascii -> re
| Epsilon: re
| Dot: re
| Concat: re -> re -> re
| Alt: re -> re -> re
| Star: re -> re.

Inductive reDenote : re -> string -> Prop :=
| AtomDe: forall ch:Ascii.ascii,
    reDenote (Atom ch) (String ch EmptyString)
| EpsilonDe: reDenote Epsilon EmptyString
| DotDe: forall ch:Ascii.ascii,
    reDenote Dot (String ch EmptyString)
| ConcatDe r1 r2: forall s1 s2 s:string,
       reDenote r1 s1 -> reDenote r2 s2 -> s = append s1 s2
    -> reDenote (Concat r1 r2) s
| AltDe r1 r2: forall s:string,
       reDenote r1 s \/ reDenote r2 s
    -> reDenote (Alt r1 r2) s
| StarDe r: forall s:string,
       reDenote (Alt (Concat r (Star r)) Epsilon) s
    -> reDenote (Star r) s.

view this post on Zulip Julin S (Nov 03 2022 at 11:17):

I had been trying to prove that the string "a" matches the regex "a*" with:

Goal
  reDenote (Star (Atom "a"%char)) "a"%string.
Proof.
  refine (StarDe _ _ _).
  refine (AltDe _ _ _ _).
  left.
  refine (ConcatDe _ _ _ _ _ _ _ _).

But at this point, I got a variable named ?Goal:

3 focused subgoals
(shelved: 2) (ID 57)

  ============================
  reDenote (Atom "a") ?Goal

subgoal 2 (ID 59) is:
 reDenote (Star (Atom "a")) ?Goal0
subgoal 3 (ID 61) is:
 "a"%string = (?Goal ++ ?Goal0)%string

What does that mean? That the goal itself is still unknown?

view this post on Zulip Gaëtan Gilbert (Nov 03 2022 at 11:31):

it's an existential variable https://coq.github.io/doc/master/refman/language/extensions/evars.html#automatic-resolution-of-existential-variables

view this post on Zulip Julin S (Nov 04 2022 at 04:37):

Okay, I could complete the proof like this:

Goal
  reDenote (Star (Atom "a"%char)) "a"%string.
Proof.
  refine (StarDe _ _ _).
  refine (AltDe _ _ _ _).
  left.
  refine (ConcatDe _ _ _ _ _ _ _ _).
  - exact (AtomDe "a").
  - refine (StarDe _ _ _).
    refine (AltDe _ Epsilon ""%string _).
    right.
    exact EpsilonDe.
  - reflexivity.
Qed.

view this post on Zulip Paolo Giarrusso (Nov 04 2022 at 05:20):

What about starting with eapply StarDe ?

view this post on Zulip Pierre Castéran (Nov 04 2022 at 06:49):

In Prolog style:

(** R r s s':  one can  scan a prefix of s which matches r,
    and s' is the remaining suffix *)

Inductive R : re -> string -> string -> Prop :=
| Rat ch : forall s, R  (Atom ch) (String ch s) s
| Rdot  : forall ch s, R Dot (String ch s) s
| Reps : forall s, R Epsilon s s
| Rstar1 r : forall s s' s'', R r s s' ->
                           R (Star r) s' s'' ->
                           R (Star r) s s''
| Rstar0 r : forall s, R (Star r) s s
| Rconcat r1 r2:  forall s s1 s2 , R r1 s s1 ->
                                   R r2 s1 s2 ->
                                   R (Concat r1 r2) s s2

| Ralt1 r1 r2 : forall s s', R r1 s s' ->
                               R (Alt r1 r2) s s'
| Ralt2 r1 r2 : forall s s', R r2 s s' ->
                               R (Alt r1 r2) s s'.

Definition reDenote r s := R r s EmptyString.

Goal reDenote (Star (Atom "a"%char)) "a"%string.
  eapply Rstar1.
  apply Rat.
  apply Rstar0.
Qed.


Goal reDenote (Concat (Star (Atom  "1"%char)) (Atom "0"%char))  "110"%string .
  econstructor 6.
  econstructor 4.
  constructor 1.
  econstructor 4.
  constructor 1.
  constructor 5.
  constructor 1.
Qed.

view this post on Zulip Julin S (Nov 05 2022 at 07:04):

Paolo Giarrusso said:

What about starting with eapply StarDe ?

Using eapply certainly feels better, thanks.

Goal
  reDenote (Star (Atom "a"%char)) "a"%string.
Proof.
  eapply StarDe.
  eapply AltDe.
  left.
  eapply ConcatDe.
  - eapply AtomDe.
  - eapply StarDe.
    eapply AltDe.
    right.
    eapply EpsilonDe.
  - reflexivity.
Qed.

view this post on Zulip Julin S (Nov 05 2022 at 07:05):

Pierre Castéran said:

In Prolog style:

Goal reDenote (Concat (Star (Atom  "1"%char)) (Atom "0"%char))  "110"%string .
  econstructor 6.
  econstructor 4.
  constructor 1.
  econstructor 4.
  constructor 1.
  constructor 5.
  constructor 1.
Qed.

Is there a way to use constructor names instead of their number? Or is the number more convenient for ltac scripts?

view this post on Zulip Pierre Castéran (Nov 05 2022 at 07:47):

Yes, you can type eapply Rconcat instead of econstructor 6, or even design tactics for moving along the regular expression and the string.
I posted this definition because it shows an interesting way of using existential variables.
This technique is for instance illustrated in https://www.cs.unm.edu/~luger/ai-final2/CH8_Natural%20Language%20Processing%20in%20Prolog.pdf

If I remember well, this way of language processing was at the origin of the creation of the Prolog programming language by Alain Colmerauer.


Last updated: Feb 09 2023 at 00:03 UTC