On Unicode

Unicode, particularly Unicode in UTF-8 encoding, is the most widely used character encoding today. It has solved one of the most annoying compatibility problems of the 1980s. Everyone is using it. Yet its details are still often misunderstood.

Back in 2003 Joel Spolsky published an excellent article on his site “Joel on Software”, covering “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”. This post is an attempt to summarize even further some of the highlights of that article. Unfortunately this means, that I will not even cover the absolute minimum here…

Joel starts out in his article by giving a historical perspective: he writes about the origins of the ASCII code, and how it led to various ASCII extensions for use with different languages. These extensions, called code pages, define the upper 128 possible characters in a byte that are left undefined by ASCII. Of course, this led to tremendous complications, whenever the code page of a document was not known. Unicode was devised as a solution to the code page problem. It defines a single standardized character set not bound by the 256 characters limit, which could grow to eventually include all characters of all known languages.

One of the most common misconceptions about Unicode is that it simply defines a 16-bit code reserving 2 bytes per character, and therefore allowing 65,536 different characters instead of 256. Actually, things are a bit more complicated than this: In Unicode, characters do not map directly to some strictly defined bit pattern to be stored in memory. Instead, characters map to something called a code point.

Not surprisingly, Latin letters map to code points, which are identical to their ASCII codes. Code points are written in a special notation; for example, the code point of the letter ‘A’ is usually written U+0041. The ‘U’ indicates, that this is Unicode, and the four digits are in hexadecimal. The common convention to use 4 digits is a hint, that Unicode was indeed originally envisioned to be a 16-bit coding scheme, but today this is not a limit anymore.

How are code points represented in memory? Restricting oneself to 16-bit code points, one can simply write out 16 bits per character in sequence. This is called UTF-16 encoding. So “Hello” may become the byte sequence 00 48 00 65 00 6C 00 6C 00 6F. That is, if we interpret 16-bit integers in big-endian byte order! On little-endian machines, the same string might be stored as 48 00 65 00 6C 00 6C 00 6F 00. As a result, even though characters map uniquely to code points and we have restricted ourselves to UTF-16, we still have ended up with two different actual encodings.

As long one stays on a big or little-endian machine, the different UTF-16 encodings cause no problem. But if one transfers Unicode text from one machine to another, the receiver must somehow be informed about the used byte order. For this purpose, people introduced the special code point U+FEFF, which is called the Unicode Byte Order Mark (BOM). This code point does not represent any printable character, but is supposed to be placed at the very beginning of each UTF-16 document for byte order identification. Depending on whether the document starts with FEFF or FFFE, a program can tell whether it was stored in big or little-endian byte order.

The complication with the two possible byte orders dampened the initial enthusiasm about Unicode. Another perceived problem was the presence of plenty of zero bytes, which wasted memory and prevented the zero byte from being used as an end-of-string mark. As a result, years passed with little Unicode adoption. Finally, a new Unicode encoding called UTF-8 was invented. UTF-8 uses only one byte for code points up to 127, and more bytes (up to six) for higher code points. UTF-8 is very compact and familiar for English language users, since English documents in UTF-8 look identical to their ASCII representation. Also, zero bytes are reserved for use as end-of-string markers again, allowing much software written for ASCII to handle Unicode with little or no change.

Most file formats nowadays use UTF-8. The only disadvantage of UTF-8 is, that characters have variable length, making random access within strings complicated. Therefore, many programs still use UTF-16 internally, even though they import and export data in UTF-8. Besides of UTF-16 and UTF-8, other encoding conventions, such as UTF-7 and UTF-32, have been defined. However, these less common encodings are rarely seen outside of a few niche applications.

After covering the fundamentals of Unicode in much more detail than given here, Joel goes on in his article to explain how to apply this knowledge to HTML, specifically the “http-equiv” meta-tag. I am not going to repeat any of this, since I could not put it in better words anyway. If you do not consider yourself a Unicode guru, and have not yet read Joel’s article, do it now! After all, as Joel states in his title: “It’s The Absolute Minimum… (No Excuses!)”

This entry was posted in ITE 221. Bookmark the permalink.