In Coq.String.String
, they say:
Strings that involve ascii characters of code >= 128 which are not part of a valid utf8 sequence of characters are not representable using the Coq string notation (use explicitly the String constructor with the ascii codes of the characters).
So how can we reason about utf8 strings (if posible at all) in coq ?
Ideally I have AST with identifiers from string, but instead I want to to work with rust strings, which is utf8 string.
The way I more from coq to rust is via this chain:
(Coq <---Extraction----> OCAML <---FFI---> Rust)
No and yes: A Coq string is a sequence of bytes, and a UTF8 string is also a sequence of bytes. In some cases you can just ignore the UTF8-ness and just pass through the string. That won't work for other tasks, but can you leave those tasks to Rust in your case?
but coq string is not really sequence of bytes this way, it is more of sequence of 7 bits, which makes me worried about what could go wrong if I pretend I am extracting to utf8-string ?
but coq string is not really sequence of bytes this way, it is more of sequence of 7 bits
this is a misunderstanding
can you please explain more, It literally said that it cannot represent characters >= 128, that is 2**7 hence my deduction that it is sequence of 7 bits
that's just the notation
Require Import String.
Open Scope string_scope.
Check String (Ascii.Ascii true true true true true true true true) "".
(* not valid utf8: prints the same way as it was input *)
Check String (Ascii.Ascii true true false false false false true true)
(String (Ascii.Ascii true false false true false true false true) "").
(* prints "é" *)
I once tried to see how strings were like in coq and wrote this:
https://rawtext.club/~famubu/blog/coq-strings.html
Not sure how correct it is though.
(Plugging it here in case someone can tell more about it and/or spot errors).
This link seems dead: https://coq.gitlab.io/zulip-archive/stream/237977-Coq-users/topic/String.20representation.20of.20extended.20ASCII.html
While making an Ascii.ascii value, we can use all 8 bits to represent characters. [...] But the character with the 8-bit value 0xff (ie, 255) is incomplete in UTF-8 as characters which can be represented in just 1-byte start with 0 whereas 255's binary value starts with 1.
It's not only 255 — any code point > 127 must be encoded specially. Even the byte 128 has the first bit set: so it's not a UTF8 "full character", but it must be part of a multi-byte code point. ("Code point" is probably what you mean by "Unicode character", but the full story is more complex).
ah, that's a link to a thread from you @Julin S ! https://coq.zulipchat.com/#narrow/stream/237977-Coq-users/topic/.E2.9C.94.20String.20representation.20of.20extended.20ASCII
as a follow up on code points — Unicode is a complex beast, and https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ is a nice (and long!) read on some of the complexities :sweat_smile:
Outside of the Unicode side, I think everything is correct, which is nice! Including the lack of C-style escapes.
However, you might want to read about String Notation in the manual... They should even allow you to implement escapes in your string notation, but I haven't tried!
I see, this might be unfortunately one the things that feels like big problem in terms of interop between coq, ocaml an rust,
because many reasons among which I am not utf8 expert, I just used to use it like I use ascii and things just worked, now I need to know what I am really doing.
Last updated: Oct 13 2024 at 01:02 UTC