paulgorman.org/technical

Character Sets, Encoding, ASCII, Unicode, and All That Jazz

(February 2017)

How do we represent natural language characters in a computer? By encoding/mapping each character to a numeric value (i.e. a pattern of bits). Various standards exists for encoding methods and sets of included characters.

Terminology

A character is a minimal unit of text that has semantic value. A character set is a collection of characters that might be used by one or multiple languages. The Latin character set, for example, includes the characters used in English (and several other European languages).

A coded character set is a character set in which each character corresponds to a unique number. A code point of a coded character set is any legal value in the character set. A code unit is a bit sequence used to encode each character of a repertoire within a given encoding form. The code unit size is equivalent to the bit measurement for the particular encoding. A code unit in US-ASCII consists of 7 bits, for example, and a code unit in UTF-8 consists of 8 bits.

The character repertoire is an abstract list of more than one million characters found in a wide variety of scripts including Latin, Cyrillic, Chinese, Korean, Japanese, Hebrew, and Aramaic. The character repertoire includes other symbols as well, like music notation symbols.

ASCII

The ASCII character encoding standard emerged in the 1960’s and was last revised in the 80’s. It was based on telegraph code. ASCII uses seven bits to represent 128 different characters (often visualized in an 8 by 16 grid chart). See ASCII(7). Note that the first two columns in the 8 by 16 grid representation consist of non-printing characters (i.e. the first thirty-two characters in the set).

Printable ASCII Characters
   2 3 4 5 6 7
-------------
0:   0 @ P ` p
1: ! 1 A Q a q
2: " 2 B R b r
3: # 3 C S c s
4: $ 4 D T d t
5: % 5 E U e u
6: & 6 F V f v
7: ' 7 G W g w
8: ( 8 H X h x
9: ) 9 I Y i y
A: * : J Z j z
B: + ; K [ k {
C: , < L \ l |
D: - = M ] m }
E: . > N ^ n ~
F: / ? O _ o DEL

We might note an ASCII character as binary, octal, decimal, or hexadecimal:

Char  Binary    Oct   Dec   Hex
A   = 1000001 = 101 = 65  = 41

A bunch of groups used the characters 128-255 left open by the eight bit unused by ASCII for non-standard additions to the character set. IBM, for example, shipped a character set with some accented characters, math symbols, playing card suit symbols, and assorted lines and bars and arrows. But such additions were not standardized between vendors or regions. Eventually, many of these variations were codified as ANSI code pages, with the upper 128 characters differing internationally. ANSI still didn’t account for the many Asian languages with characters that couldn’t fit in eight bits; some characters could be expressed in one 8-bit byte, but others required two bytes.

Unicode

Unicode attempts to cover all written characters. A common misconception about Unicode is that it’s simply a 16-bit characters set that can represent 65,536 possible characters; this is not correct. Unicode is not itself an encoding scheme. Unlike in ASCII, where a letter maps to a few bits, a Unicode symbol maps to a code point. The Unicode Consortium assigns each symbol a code point, written like “U+0041”. Unicode isn’t limited to 65,536 code points; the Unicode Consortium will assign as many as they wish, and they’ve already surpassed 65,536. (The total Unicode space is 17 planes of 65,536 code points totaling 1,114,112.)

What code point the Unicode Consortium has assigned to a character doesn’t tell us anything about how that characters is stored in memory. Unicode is not itself an encoding scheme.

Hello = U+0048 U+0065 U+006C U+006C U+006F

…but how those code points get represented in memory depends on a number of factors, such as the machine architecture. Unicode originally (in encoding UCS-2) envisioned each character represented by two bytes, but UTF-16 uses sixteen bits. Is the machine big or little endian? Each UTF-16 representation begins with a Unicode Byte Order Mark (FF or FE).

00 48 00 65 00 6C 00 6C 00 6F
48 00 65 00 6C 00 6C 00 6F 00

UTF-32 is large enough to encode the entire Unicode space, but four bytes per code point wastes a lot of memory for most characters.

UTF-8 is yet another encoding for Unicode — a variation on how to represent Unicode in memory. UTF-8 stores code points from 0-127 a single byte; it stores code points above 128 in multiple bytes as necessary (generally two, but up to six bytes). This imparts the benefit that UTF-8 representations of the English code points match the numeric representations of those characters in ASCII.

Hello = U+0048 U+0065 U+006C U+006C U+006F = 48 65 6C 6C 6F

There are many other ways to encode Unicode besides UCS-2, UTF-16, and UTF-8. Not all the encoding schemes completely cover all Unicode code points. Where our local encoding doesn’t include a particular Unicode code point, we see the missing symbol represented by the Unicode “replacement character” U+FFFD �.

Unicode “Code Points”

In Unicode, a “code point” is something of an abstraction that means variously:

Unicode Chess Symbols

Name                Symbol  Codepoint  HTML
white chess king    ♔       U+2654     &#9812;
white chess queen   ♕       U+2655     &#9813;
white chess rook    ♖       U+2656     &#9814;
white chess bishop  ♗       U+2657     &#9815;
white chess knight  ♘       U+2658     &#9816;
white chess pawn    ♙       U+2659     &#9817;
black chess king    ♚       U+265A     &#9818;
black chess queen   ♛       U+265B     &#9819;
black chess rook    ♜       U+265C     &#9820;
black chess bishop  ♝       U+265D     &#9821;
black chess knight  ♞       U+265E     &#9822;
black chess pawn    ♟       U+265F     &#9823;