UTF-8 - the brilliant trick to rule them all
By Tomasz Kuczma
UTF-8 is probably the biggest invention in electronic text communication since the invention of the ASCII table in 1967 and remains so to this day. It dominated the World Wide Web in 2009 and it is used by almost 95% of websites nowadays. In fact, everybody uses it but many might be not aware of why this standard has been chosen so let me put some light on this and show you the brilliant trick.
History lesson first
ASCII table was invented to unify characters encoding across different computers. Before that, every computer company used different numbers to represent different characters of a text. The user was not able to save the text on one PC and display it on another PC correctly.
ASCII - American Standard Code for Information Interchange solved that problem. The standard provides a mapping of 127 characters to a 7-bit number. Simple and does the job.
The problem with ASCII
American standard solved problem perfectly in those days but totally ignored existence of languages which uses non-Latin alphabet. That started to become a problem with growing international demand for computers. Many companies developed new encoding standards that use 8-bits instead of 7 so they can use 127 additional characters to represent non-Latin characters. A lot of standards has been created like ISO 8859-1 and ISO 8859-2 and finally, nationalities got their characters:
- Germans and Swedes got umlaut letters like Ü, Ä,
- Poles and Lithuanians got diacritic hooks “ogonek” like Ą, Ę,
- Chinese got own palette of characters like 一世 (multibyte per character encoding), etcetera.
The problem with many standards
Unfortunately, having many standards caused the new problem. Now we need to support them all (or most of them) and detect which one is currently used. Solution is a new standard that grabs all the characters around the world and puts them into one standard. Looks like we come the full circle :)
The simple implementation would just use a 16-bit number per character to support 65k characters. Or 24-bit or 32-bit number to support even more characters in one standard.
Right now, the newest Unicode standard 12.1.0 supports [https://www.unicode.org/versions/Unicode12.1.0/](137,929 characters). Which means we would have to use at least 3 bytes per letter to encode them all. That is wasteful! That means, encoding all standard Latin alphabet letters using 24-bit waste 17 bits in comparison to 7-bit ASCI. So standard English based text becomes 3 times bigger in size.
The solution: UTF-8
Inventors of UTF-8 saw that problem. In many languages, the Latin alphabet is the base and diacritic (special characters) have a minority share. Using more than 8-bit to encode every kind of character would be too big wasteful as mentioned above. They designed the UTF-8 in the way that every number (bit representation of the letter) has information about its length. This encoding can be represented by the following table:
Number of bytes used by UTF-8 per letter | Number of bits used for character encoding | Range of UTF-8 letters | 1st byte | 2nd byte | 3rd byte | 4th byte | |
---|---|---|---|---|---|---|---|
1 | 7 | U+0000 | U+007F | 0xxxxxxx | - | - | - |
2 | 11 | U+0080 | U+07FF | 110xxxxx | 10xxxxxx | - | - |
3 | 16 | U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | - |
4 | 21 | U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
The goal has been reached:
- We are still fully compatible with ASCII, so we can reuse a lot of already written software and it’s gonna work for English text.
- Most popular characters are still represented by 1-byte number and most of the diacritics by 2-byte number.
ASCII characters are represented by 8-bit instead of 7-bit (efficiency $7/8=~88%$). Many diacritics have efficiency $11/16=~69%$. They did even cover Chinese using 3-byte and emojis and math symbols using 4-byte numbers. The standard already has a place for +2 million characters ($~2^21$) represented by 4 bytes.
A careful reader will notice something strange
Why the second character is prefixed with 10. Isn’t it wasteful? Nothing could be more wrong. This trick solves two main problems:
- When you read a random byte somewhere in the middle of the string, you know that this is a continuation byte and you need to rewind to the previous byte to understand this. There is no way you miss interpret a character.
- The end of the string byte cannot appear in the middle of the string. End of the string byte or the “null character” is the byte with all bits set to 0 and every software stop reading the text stream when this byte appears. UTF-8 keeps compatibility with it (ASCII) and prevents having it as the value of the continuation byte thanks to that prefix.
Implementation
Let me show you simple implementation which parse UTF-8 encoded byte stream to an integer representing the character in the Unicode table:
inline int countLeadingOnesInByte(int byte) {
byte = 0xFF & (~byte);
//use GCC built in to call "count leading zeros" CPU instruction
return __builtin_clz(byte) - (sizeof(int) - 1) * 8;
}
wchar_t readUtf8Char(ByteStream byteStream) {
uint_fast8_t firstByte = byteStream.nextByte();
int leadingOnes = countLeadingOnesInByte(firstByte);
const int mask = (1 << (7 - leadingOnes)) - 1;
int tmp = firstByte & mask;
//validation if first byte is not continuation
//validation if leadingOnes is less than 4 as per current standard
while(--leadingOnes > 0) {
tmp <<= 6;
//validation if nextByte() is continuation?
tmp |= byteStream.nextByte() & 0x3F;
}
return (wchar_t) tmp;
}
This implementation is simplified. I put some comments on where are the potential places for validation. Final implementation should consider wider context.
The problem with UTF-8
The main problem is the calculation of the length of the string and accessing n-th character. There is no easy way to do that when a string is stored as a simple byte array.
The second problem might be the size of the Chinese and Japanese characters. UTF-8 stores them using 3 bytes and UTF-16 requires only 2. That might be a problem for many people (the population of China is 1.4 billion) as a lot of text has to be transferred to the users on the web using not so efficient format (33% more data to transfer). From the other side, it really depends on how the text is served. E.g. HTML or other markup languages use mostly ASCII based characters for formatting hance overhead for using more space for Chinese and Japanese characters might be less than saving made on not storing 2 bytes for each ASCII character (in comparison to UTF-16).
Summary
UTF-8 name might be misleading as this is not fixed-length encoding. It solves effectively the problem of compatibility with ASCII, handles null character and is a cost (storage) efficient. The trick used to achieve that is just brilliant and could be reused in other problems e.g. storing big decimals effectively.
There are other Unicode formats like UTF-32 and UTF-16 which are fixed-length which are also in use. E.g. UTF-16 is used by Java Strings but this is the topic for a separate article.
Software engineer with a passion. Interested in computer networks and large-scale distributed computing. He loves to optimize and simplify software on various levels of abstraction starting from memory ordering through non-blocking algorithms up to system design and end-user experience. Geek. Linux user.