I was trying to understand how strings are represented and coq and since a string
consists of a list of Ascii.ascii
values, and each Ascii.ascii
only have space for 8 bits, it can't represent most values in unicode (right?).
Representation of the character 'é'
which has ASCII' value 233, looks like
Unset Printing Notations.
Compute "é"%string.
(*
= String (Ascii.Ascii true true false false false false true true) (* 11000011 *)
(String (Ascii.Ascii true false false true false true false true) (* 10101001 *)
EmptyString)
: string *)
Here, representation of é
seem to consist of a concatentation of the letters 'Ã'
and '©'
.
Ascii.Ascii true true false false false false true true
: 'Ã'
Ascii.Ascii true false false true false true false true
: '©'
But then how would a string "é"
be made?
Compute "é"%string.
(*
= String (Ascii.Ascii true true false false false false true true) (* 11000011 *)
(String (Ascii.Ascii true true false false false false false true) (* 10000011 *)
(String
(Ascii.Ascii false true false false false false true true) (* 11000010 *)
(String
(Ascii.Ascii true false false true false true false true) (* 10101001 *)
EmptyString)))
: string
*)
This seem to consist of 4 characters instead of 2. And the first two Ascii.ascii
are the same values.
Why is the string being like this? What rules govern the string building process?
Also, since Ascii.Ascii
have 8 bits to be used and a character like é
can be represented using 8 bits itself, why are two Ascii.Ascii
used for that instead of just 1?
there are many ways to encode characters into bytes in general.
For instance, letter é might have “value 233” in some character encoding, but it won’t be ASCII, since that one has only character values until 127.
I’d expect Coq to use the system encoding, which in your example could be using UTF-8: that encoding allows encoding all Unicode values into a sequence of bytes, where each code point consumes 1 or more bytes.
Indeed, my system uses UTF-8 and gives the same result for é
. While 233 encodes é
in the Latin-1 (or ISO-8859-1) encoding, a common one in Western Europe before the switch to UTF8.
The 2nd example seems to agree: your system (and Coq) is using UTF8 but you're decoding using Latin1. Or maybe Windows 1252 since it "extends" Latin1. All your characters appear in this Latin1 table: https://cs.stanford.edu/people/miles/iso8859.html.
I think coq only understands unicode, the system encoding matters for your editor only
Well string
is clearly using the UTF8 encoding, which I'd presume must agree with the system encoding
eg https://github.com/coq/coq/blob/351a58ef8549c21cc56df2cc308a9e2458dca80e/interp/notation.ml#L499 is_utf8
determines if a string gets printed as a string or not, compare
Require Import String.
Open Scope string_scope.
Check String (Ascii.Ascii true true false false false false true true)
(String (Ascii.Ascii true false false true false true false true) "").
(* "é" *)
(* invalid unicode *)
Check String (Ascii.Ascii false false false false false false false true)
(String (Ascii.Ascii true false false false false false false true) "").
(*
String (Ascii.Ascii false false false false false false false true)
(String (Ascii.Ascii true false false false false false false true) "")
*)
ah, so Coq straight out doesn't work with a non-UTF8 system encoding? Fun.
(tho I guess it's a reasonable compromise these days)
"ascii" should really not use that name though, they're just bytes
yep — string
is also arguably misnamed, since it's really a sequence of bytes not of "characters"
e.g.
Compute (String.length "©").
= 2
: nat
not that the concept of "Unicode character" is really useful if you're trying to process Unicode text correctly (the closest thing is a "basic grapheme cluster" but it's complicated).
Paolo Giarrusso said:
yep —
string
is also arguably misnamed, since it's really a sequence of bytes not of "characters"
but string does not mean "sequence of characters", it means "something that works with ""
"
your example only shows that String.length could have a more explicit name (eg byte_length)
many languages' stdlibs distinguish "sequence of bytes" from "sequence of characters" and reserve "string" for the latter.
or at least, that's how things go on the JVM
OTOH, I'm not sure you need the distinction if you default to UTF8 everywhere, which is probably a good idea.
So does coq always work using utf8 encoding for string
values?
And could this module be relevant? https://coq.inria.fr/library/Coq.Unicode.Utf8.html
That module just defines some Unicode analogues of Coq notations — ≤
instead of <=
.
https://github.com/coq/coq/blob/master/theories/Unicode/Utf8.v https://github.com/coq/coq/blob/master/theories/Unicode/Utf8_core.v
That module just defines some Unicode analogues of Coq notations —
≤
instead of<=
.
Oh.. Now it seems obvious. :sweat_smile: Should've taken a closer look at those files. Thanks.
But why is the representation of 'é'
looking same as a concatenation of the representations of 'Ã'
and '©'
though? And why does it take two Ascii.ascii
values to represent 'é'
instead of one even though its value is less than 256?
I guess this has something to do with the way utf8 works, but couldn't figure it out.
And why does it take two Ascii.ascii values to represent 'é' instead of one even though its value is less than 256?
because it's value is 2 bytes in unicode
unicode only coincides with ascii for values less than 127
the unicode of é
is the same as the ascii of é
what you call unicode is UTF8, and ascii is probably ISO 8859-1 (or some variant)
unicode only coincides with ascii for values less than 127
what you call unicode is UTF8,
Oh.. That makes sense...
So, 'é'
is 0xc3a9
in utf8. So it needs two bytes.
'Ã'
is 0xc383
'©'
is 0xc2a9
They need 2 bytes as well.
So "é"
needs 2 + 2 = 4 bytes.
I think I get it now. Thanks a lot everyone!
Julin S has marked this topic as resolved.
Does anyone know the part of the coq reference manual that mentions about strings being encoded in utf8 by default?
I've looked for it, and found nothing.
Also, I wonder but it doesn't seem just a default — the code linked earlier seemed unconditional, and coqc
does not have an option to change the encoding (tho it could in theory support reading the LC_...
environment variables).
It'd probably be worth documenting this somewhere around https://coq.inria.fr/refman/language/core/basic.html#lexical-conventions. In particular, "Identifiers" already mentions support for Unicode letters (tho that's orthogonal to the file encoding):
unicode_letter non-exhaustively includes Latin, Greek, Gothic, Cyrillic, Arabic, Hebrew, Georgian, Hangul, Hiragana and Katakana characters, CJK ideographs, mathematical letter-like symbols and non-breaking space. unicode_id_part non-exhaustively includes symbols for prime letters and subscripts.
(I'd suspect, but I'm not sure, that unicode_letter
means "whatever some Unicode version declares as letter", but I'm not sure).
I think string notations are not utf8, they are just the raw bytes between "
and "
.
however they only get printed as string notations when they're valid utf8
so it seems that Coq only _supports_ UTF8, even if lots of it might be somewhat encoding-agnostic (with no guarantees?)
Looks like a PR adding this precision to the documentation of String Notation
would be welcome.
I think it's rather that the feature does the minimum effort to work in the simplest case (ascii strings) and everything beyond that is an unintended side effect.
the code that Gaëtan pointed to supports UTF8, not just ASCII.
Last updated: Oct 13 2024 at 01:02 UTC