Stream: Coq users

Topic: ✔ String representation of extended ASCII


view this post on Zulip Julin S (Jan 02 2022 at 06:26):

I was trying to understand how strings are represented and coq and since a string consists of a list of Ascii.ascii values, and each Ascii.ascii only have space for 8 bits, it can't represent most values in unicode (right?).

Representation of the character 'é' which has ASCII' value 233, looks like

Unset Printing Notations.

Compute "é"%string.
(*
= String (Ascii.Ascii true true false false false false true true)         (* 11000011 *)
         (String (Ascii.Ascii true false false true false true false true) (* 10101001 *)
            EmptyString)
     : string *)

Here, representation of é seem to consist of a concatentation of the letters 'Ã' and '©'.

But then how would a string "é" be made?

Compute "é"%string.
(*
= String (Ascii.Ascii true true false false false false true true)           (* 11000011 *)
         (String (Ascii.Ascii true true false false false false false true)  (* 10000011 *)
            (String
               (Ascii.Ascii false true false false false false true true)    (* 11000010 *)
               (String
                  (Ascii.Ascii true false false true false true false true)  (* 10101001 *)
                  EmptyString)))
     : string
*)

This seem to consist of 4 characters instead of 2. And the first two Ascii.ascii are the same values.

Why is the string being like this? What rules govern the string building process?

Also, since Ascii.Ascii have 8 bits to be used and a character like é can be represented using 8 bits itself, why are two Ascii.Ascii used for that instead of just 1?

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 10:59):

there are many ways to encode characters into bytes in general.

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:01):

For instance, letter é might have “value 233” in some character encoding, but it won’t be ASCII, since that one has only character values until 127.

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:03):

I’d expect Coq to use the system encoding, which in your example could be using UTF-8: that encoding allows encoding all Unicode values into a sequence of bytes, where each code point consumes 1 or more bytes.

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:08):

Indeed, my system uses UTF-8 and gives the same result for é. While 233 encodes é in the Latin-1 (or ISO-8859-1) encoding, a common one in Western Europe before the switch to UTF8.

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:18):

The 2nd example seems to agree: your system (and Coq) is using UTF8 but you're decoding using Latin1. Or maybe Windows 1252 since it "extends" Latin1. All your characters appear in this Latin1 table: https://cs.stanford.edu/people/miles/iso8859.html.

view this post on Zulip Gaëtan Gilbert (Jan 02 2022 at 11:24):

I think coq only understands unicode, the system encoding matters for your editor only

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:25):

Well string is clearly using the UTF8 encoding, which I'd presume must agree with the system encoding

view this post on Zulip Gaëtan Gilbert (Jan 02 2022 at 11:26):

eg https://github.com/coq/coq/blob/351a58ef8549c21cc56df2cc308a9e2458dca80e/interp/notation.ml#L499 is_utf8 determines if a string gets printed as a string or not, compare

Require Import String.
Open Scope string_scope.

Check String (Ascii.Ascii true true false false false false true true)
      (String (Ascii.Ascii true false false true false true false true) "").
(* "é" *)

(* invalid unicode *)
Check String (Ascii.Ascii false false false false false false false true)
  (String (Ascii.Ascii true false false false false false false true) "").
(*
String (Ascii.Ascii false false false false false false false true)
  (String (Ascii.Ascii true false false false false false false true) "")
*)

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:27):

ah, so Coq straight out doesn't work with a non-UTF8 system encoding? Fun.

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:28):

(tho I guess it's a reasonable compromise these days)

view this post on Zulip Gaëtan Gilbert (Jan 02 2022 at 11:29):

"ascii" should really not use that name though, they're just bytes

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:30):

yep — string is also arguably misnamed, since it's really a sequence of bytes not of "characters"

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:31):

e.g.

Compute (String.length "©").
     = 2
     : nat

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:33):

not that the concept of "Unicode character" is really useful if you're trying to process Unicode text correctly (the closest thing is a "basic grapheme cluster" but it's complicated).

view this post on Zulip Gaëtan Gilbert (Jan 02 2022 at 11:36):

Paolo Giarrusso said:

yep — string is also arguably misnamed, since it's really a sequence of bytes not of "characters"

but string does not mean "sequence of characters", it means "something that works with """
your example only shows that String.length could have a more explicit name (eg byte_length)

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:37):

many languages' stdlibs distinguish "sequence of bytes" from "sequence of characters" and reserve "string" for the latter.

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:39):

or at least, that's how things go on the JVM

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:41):

OTOH, I'm not sure you need the distinction if you default to UTF8 everywhere, which is probably a good idea.

view this post on Zulip Julin S (Jan 02 2022 at 11:42):

So does coq always work using utf8 encoding for string values?

view this post on Zulip Julin S (Jan 02 2022 at 11:42):

And could this module be relevant? https://coq.inria.fr/library/Coq.Unicode.Utf8.html

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 11:48):

That module just defines some Unicode analogues of Coq notations — instead of <=.
https://github.com/coq/coq/blob/master/theories/Unicode/Utf8.v https://github.com/coq/coq/blob/master/theories/Unicode/Utf8_core.v

view this post on Zulip Julin S (Jan 02 2022 at 12:35):

That module just defines some Unicode analogues of Coq notations — instead of <=.

Oh.. Now it seems obvious. :sweat_smile: Should've taken a closer look at those files. Thanks.

view this post on Zulip Julin S (Jan 02 2022 at 12:39):

But why is the representation of 'é' looking same as a concatenation of the representations of 'Ã' and '©' though? And why does it take two Ascii.ascii values to represent 'é' instead of one even though its value is less than 256?

I guess this has something to do with the way utf8 works, but couldn't figure it out.

view this post on Zulip Gaëtan Gilbert (Jan 02 2022 at 13:25):

And why does it take two Ascii.ascii values to represent 'é' instead of one even though its value is less than 256?

because it's value is 2 bytes in unicode
unicode only coincides with ascii for values less than 127

view this post on Zulip Gaëtan Gilbert (Jan 02 2022 at 13:26):

the unicode of é is the same as the ascii of é

view this post on Zulip Pierre-Marie Pédrot (Jan 02 2022 at 13:28):

what you call unicode is UTF8, and ascii is probably ISO 8859-1 (or some variant)

view this post on Zulip Julin S (Jan 02 2022 at 13:37):

unicode only coincides with ascii for values less than 127

what you call unicode is UTF8,

Oh.. That makes sense...

So, 'é' is 0xc3a9 in utf8. So it needs two bytes.

They need 2 bytes as well.

So "é" needs 2 + 2 = 4 bytes.

I think I get it now. Thanks a lot everyone!

view this post on Zulip Notification Bot (Jan 02 2022 at 13:38):

Julin S has marked this topic as resolved.

view this post on Zulip Julin S (Jan 02 2022 at 16:02):

Does anyone know the part of the coq reference manual that mentions about strings being encoded in utf8 by default?

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 16:15):

I've looked for it, and found nothing.

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 16:16):

Also, I wonder but it doesn't seem just a default — the code linked earlier seemed unconditional, and coqc does not have an option to change the encoding (tho it could in theory support reading the LC_... environment variables).

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 16:18):

It'd probably be worth documenting this somewhere around https://coq.inria.fr/refman/language/core/basic.html#lexical-conventions. In particular, "Identifiers" already mentions support for Unicode letters (tho that's orthogonal to the file encoding):

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 16:18):

unicode_letter non-exhaustively includes Latin, Greek, Gothic, Cyrillic, Arabic, Hebrew, Georgian, Hangul, Hiragana and Katakana characters, CJK ideographs, mathematical letter-like symbols and non-breaking space. unicode_id_part non-exhaustively includes symbols for prime letters and subscripts.

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 16:19):

(I'd suspect, but I'm not sure, that unicode_letter means "whatever some Unicode version declares as letter", but I'm not sure).

view this post on Zulip Li-yao (Jan 02 2022 at 16:19):

I think string notations are not utf8, they are just the raw bytes between " and ".

view this post on Zulip Gaëtan Gilbert (Jan 02 2022 at 16:24):

however they only get printed as string notations when they're valid utf8

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 16:48):

so it seems that Coq only _supports_ UTF8, even if lots of it might be somewhat encoding-agnostic (with no guarantees?)

view this post on Zulip Pierre Roux (Jan 02 2022 at 16:53):

Looks like a PR adding this precision to the documentation of String Notation would be welcome.

view this post on Zulip Li-yao (Jan 02 2022 at 16:57):

I think it's rather that the feature does the minimum effort to work in the simplest case (ascii strings) and everything beyond that is an unintended side effect.

view this post on Zulip Paolo Giarrusso (Jan 02 2022 at 18:51):

the code that Gaëtan pointed to supports UTF8, not just ASCII.


Last updated: Oct 13 2024 at 01:02 UTC