Stream: Coq users

Topic: Utf8 strings in coq?


view this post on Zulip walker (Feb 26 2023 at 10:46):

In Coq.String.String, they say:

Strings that involve ascii characters of code >= 128 which are not part of a valid utf8 sequence of characters are not representable using the Coq string notation (use explicitly the String constructor with the ascii codes of the characters).

So how can we reason about utf8 strings (if posible at all) in coq ?

Ideally I have AST with identifiers from string, but instead I want to to work with rust strings, which is utf8 string.

The way I more from coq to rust is via this chain:
(Coq <---Extraction----> OCAML <---FFI---> Rust)

view this post on Zulip Paolo Giarrusso (Feb 26 2023 at 13:37):

No and yes: A Coq string is a sequence of bytes, and a UTF8 string is also a sequence of bytes. In some cases you can just ignore the UTF8-ness and just pass through the string. That won't work for other tasks, but can you leave those tasks to Rust in your case?

view this post on Zulip walker (Feb 26 2023 at 13:53):

but coq string is not really sequence of bytes this way, it is more of sequence of 7 bits, which makes me worried about what could go wrong if I pretend I am extracting to utf8-string ?

view this post on Zulip Gaëtan Gilbert (Feb 26 2023 at 13:57):

but coq string is not really sequence of bytes this way, it is more of sequence of 7 bits

this is a misunderstanding

view this post on Zulip walker (Feb 26 2023 at 14:01):

can you please explain more, It literally said that it cannot represent characters >= 128, that is 2**7 hence my deduction that it is sequence of 7 bits

view this post on Zulip Gaëtan Gilbert (Feb 26 2023 at 14:05):

that's just the notation

view this post on Zulip Gaëtan Gilbert (Feb 26 2023 at 14:06):

Require Import String.
Open Scope string_scope.

Check String (Ascii.Ascii true true true true true true true true) "".
(* not valid utf8: prints the same way as it was input *)

Check String (Ascii.Ascii true true false false false false true true)
  (String (Ascii.Ascii true false false true false true false true) "").
(* prints "é" *)

view this post on Zulip Julin S (Feb 26 2023 at 18:44):

I once tried to see how strings were like in coq and wrote this:

https://rawtext.club/~famubu/blog/coq-strings.html

Not sure how correct it is though.

(Plugging it here in case someone can tell more about it and/or spot errors).

view this post on Zulip Paolo Giarrusso (Feb 26 2023 at 19:01):

It's not only 255 — any code point > 127 must be encoded specially. Even the byte 128 has the first bit set: so it's not a UTF8 "full character", but it must be part of a multi-byte code point. ("Code point" is probably what you mean by "Unicode character", but the full story is more complex).

view this post on Zulip Paolo Giarrusso (Feb 26 2023 at 19:04):

ah, that's a link to a thread from you @Julin S ! https://coq.zulipchat.com/#narrow/stream/237977-Coq-users/topic/.E2.9C.94.20String.20representation.20of.20extended.20ASCII

view this post on Zulip Paolo Giarrusso (Feb 26 2023 at 19:07):

as a follow up on code points — Unicode is a complex beast, and https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ is a nice (and long!) read on some of the complexities :sweat_smile:

view this post on Zulip Paolo Giarrusso (Feb 26 2023 at 19:17):

Outside of the Unicode side, I think everything is correct, which is nice! Including the lack of C-style escapes.

However, you might want to read about String Notation in the manual... They should even allow you to implement escapes in your string notation, but I haven't tried!

view this post on Zulip walker (Feb 27 2023 at 10:08):

I see, this might be unfortunately one the things that feels like big problem in terms of interop between coq, ocaml an rust,
because many reasons among which I am not utf8 expert, I just used to use it like I use ascii and things just worked, now I need to know what I am really doing.


Last updated: Oct 13 2024 at 01:02 UTC