Unicode in Emacs
Characters
Emacs lisp has a special read syntax for characters: the ?
question mark.
Such syntax is necessary to distinguish the character à
from the symbol à
.
Characters are Integers
Characters evaluate to integers:
?à
224
We can see that ?à
and 224
are equal in (almost?) every way:
(equal ?à 224)
t
(eq ?à 224)
t
(char-equal ?à 224)
t
Converting Between Chars and Integers
(string-to-char "à")
224
(char-to-string 224)
à
The argument to char-to-string
must be a character, so it can either be given
as:
- A character using
?
syntax - An integer, using any of the ways to express an integer
For example, as an integer using hex notation:
(char-to-string #xe0)
à
Or, as a character using hex notation:
(char-to-string ?\xe0)
à
Why Have Two Representations?
The reader syntax ?
simply allows you to express a number using a
character. Sometimes that is very helpful and provides clarity (when you are
dealing with characters) and sometims that would be silly (when dealing with
numbers).
The elisp manual notes:
whether an integer is a character or not is determined only by how it is used
(+ ?à 2) ; Usually not helpful
226
(make-string 5 ?à) ; Helpful
ààààà
Which, by the way, could also be written as:
(make-string 5 224) ; Not helpful
ààààà
In Emacs, all characters are integers, but not all integers are characters. A character's corresponding integer is simply the Unicode number (i.e. the Unicode code point) of the character.
Unicode code points are defined from the integers #x0 to
(char-to-string (+ 1 #x10FFFF))
The ?
read syntax for a character also allows for the character to be
expressed as a hex number or an octal number. As before, the expression
evaluates to a decimal number which represents the character's code point.
?\xe0
224
?\340
224
TODO: wtf is this?
The function make-char
returns an integer representing the character at the
given position in a… what?
(make-char 'unicode 0 0 224 0)
224
Unicode Escape Sequences
A character can be defined using a unicode escape sequence. There are two forms for Unicode escape sequences:
\uXXXX
(\u
and four hex digits)\U00XXXXXX
(\U00
and six hex digits)
Evaluating a character with a Unicode escape sequence returns an integer:
?\u00e0
224
?\U0001F638
128568
Render the character using char-to-string
:
(char-to-string ?\u00e0)
à
Also, evaluating a string with a Unicode escape sequence returns a string:
"\U0001F638"
😸
Convert Unicode Code Point to Character
The function (char-to-string CHAR)
returns the character at the decimal code
point CHAR
. Unicode code points in the "U+2388" format are hex, so they must
first be converted to decimal.
Examples using Unicode Character "⎈" (U+2388):
(char-to-string ?\u2388)
⎈
(char-to-string ?\x2388)
⎈
"?\u2388"
?⎈
"?\N{HELM SYMBOL}"
?⎈
Convert Unicode name to character
In a string, the "?\N{NAME}"
syntax allows you to specify a Unicode character
by its name:
"?\N{LATIN SMALL LETTER A WITH GRAVE}"
"?à"
Encode a string
Encoding a string means translating a string of Unicode code points (integers) into new integers, according to some encoding scheme (like UTF-8). This is necessary in order to tell where one number ends and the next begins:
1224
Is that the single character 1224
: "ӈ"? Or two characters 12
and 24
? Or..
something else? UTF-8 encodes strings into a binary form that can be
unambiguously reversed (decoded) back to Unicode code points.
Viewing encoded strings is sometimes difficult, because the binary form of a string is automatically decoded in order to be displayed.
(encode-coding-string "naïve" 'utf-8 t)
"nai\314\210ve"
encode-coding-string
returns a string where any characters outside of <what
range?> are escaped (WHY?)
(encode-coding-string "\u0073" 'utf-8)
s
(encode-coding-string "\U0001F638" 'utf-8)
"\360\237\230\270"
toggle-enable-multibyte-characters
Another way to see this is to write multibyte strings to a file, then run M-x
toggle-enable-multibyte-characters
.
Decode
Decoding a string returns the multibyte equivalent of the string.
(decode-coding-string "nai\314\210ve" 'utf-8)
naïve
(decode-coding-string "\360\237\230\270" 'utf-8)
😸
Code Points
Range is 0 to #x10FFFF (hex).
Emacs extends this with range #x110000 to #x3FFFF
A character codepoint in Emacs is a 22-bit *.
Decode integer a string
Unicode name
(char-from-name "LATIN SMALL LETTER A WITH GRAVE")
224
Unicode number
Unicode number as decimal:
(char-to-string 128568)
😸
As hex
(char-to-string ?\x1F638)
128568
As octal
?\340
224
Normalize a string
(ucs-normalize-NFD-string "nai\u0308ve")
"naïve"
Elisp
A character in Emacs Lisp is nothing more than an integer:
Characters in strings and buffers are currently limited to the range of 0 to 4194303—twenty two bits
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Type.html
Unicode table
The variable ucs-names
(in mule.el
) holds a hash table.
The function ucs-names
returns the fully expanded table of unicode data.
Resources
- https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Type.html
- https://www.gnu.org/software/emacs/manual/html_node/elisp/Non_002dASCII-Characters.html
- https://www.gnu.org/software/emacs/manual/html_node/elisp/Strings-and-Characters.html
- https://www.gnu.org/software/emacs/manual/html_node/elisp/Describing-Characters.html#Describing-Characters
- https://www.gnu.org/software/emacs/manual/html_node/elisp/String-Type.html
- https://www.gnu.org/software/emacs/manual/html_node/emacs/International.html
TODO
Does a character evaluate to different numbers under different coding systems?
Can Emacs interpret a byte array as characters? Example: an IPv4 address, which is often represented as 4 bytes.