Encodings, Unicode and broken code

This is another sad tale of character encodings. Consider this LeetCode problem that asks to check whether the given strings are isomorphic. Isomorphic strings being defined as strings of the same length with a bijection mapping between the characters. For example, “aba” is isomorphic to “ava” with mapping a \leftrightarrow a, b \leftrightarrow v and “mlm” with mapping a \leftrightarrow m, b \leftrightarrow l, but not to “aaa” (no bijection since both “a” and “b” are mapped to “a”).

Now consider possible solutions in Java. One obvious solution:

Runs in 36 ms, certainly not the fastest submission. One way to “optimize” it:

This runs in 12 ms. Three times faster! Beating 92%! And here is yet another version:

Now, let me ask a question: which of the solutions above is the best one?

The last one is something an English speaker with C background might come up with. It will obviously break for any characters outside US-ASCII, including Cyrillic, Chinese, Hebrew or even German or Irish (because of the umlauts and fada). So obviously it’s not acceptable.

The second one is trickier. One thing is that it might break if one of the strings contains NUL characters because we abuse NUL as the “character not mapped” special value. Another thing is that initializing the whole array with zeroes takes \mathcal{O}(65536) time which could make it a poor choice for short strings.

So it looks like the first one is the best, right? It scales nice to any lengths, and even though it’s slower, it handles NULs properly.

Well, the answer is: all of them are wrong! One test case that none of the solutions above will pass is “ab”, “冬b”. In case you can’t see it, here is a picture:

kanji

That’s right, that one weird Chinese character is enough to break all of the solutions above. Moreover, it breaks LeetCode testing system as well (just like Cyrillic or anything non-ASCII does) and LeetCode Discuss forums too (unlike Cyrillic and many other non-ASCII symbols). Why? What’s wrong with that particular character? Java stores strings using Unicode, right? That’s why char is two bytes, after all! So it should be able to handle any characters without any problems! The dark age of terrible national encodings is over!

In order to understand it, we must look back at the history of encodings and Uncode.

It all started in 1960s or even long before that (Morse code came into existence long before the first computer). But it’s in 1960s that all hell broke loose. In 1963 both ASCII and EBCDIC were introduced. While even EBCDIC is apparently still in use today, it’s ASCII that became widespread, and the fact that ASCII was a 7-bit encoding meant that there was one “free” bit and 128 unused codes in the 128-255 range. That, and the lack of any letters except basic Latin, immediately gave birth to a myriad of various national encodings. Worse, multiple encodings were sometimes used for the same languages. I know of four Russian, for example: code page 866 (“MS-DOS” encoding), code page 1251 (”Windows” or ANSI encoding), KOI8-R (a really weird encoding that arranges letter according to English alphabet, not Russian one, was really widespread in the early days of Russian Internet) and the “standard” ISO-8859-5 that was rarely used at all. This is still a major source of various troubles, as when you run a program in a console window, you have no idea which encoding will be used and therefore you have about 50% chance of getting garbage (less in practice because most programs will use the MS-DOS encoding). And nobody plans to fix it because it is impossible and because nobody cares about console windows nowadays.

Chinese and Japanese people got it even worse: 128 values are obviously not enough to represent about 2000 ideographs in Japanese (and that’s only a subset of Chinese!), so they went ahead and invented two-byte encodings, which made things much worse because now, having some bytes, you couldn’t even determine the string length if you had no idea which encoding is used.

Then Unicode came into being. The first standard was published in 1991 and it introduced a 16-bit encoding intended for universal use, which included all characters deemed reasonable. Unfortunately, the bunch of Old Evil Encodings didn’t disappear at the very same moment, so the only thing that really happened that day is that the world now had one more encoding to deal with. No, wait, make it two encodings because Unicode defined characters as 16-bit units, but those can be represented with bytes using either Little Endian or Big Endian order.

Even worse, Unicode apparently failed to consider some important characters like rarely used ideographs (like that 冬), even though they are a part of personal names and names of places. Imagine you can’t type your own name as you’re trying to use some software! So apparently some extension was needed. That is how Unicode transformed from a single 16-bit encoding into a whole standard of concepts and encodings.

The core concept is the code point. A code point is a 21-bit number corresponding to some character, typically represented as a 32-bit integer in memory and as something like U+00B0 in writing, where 00B0 is the hexadecimal of the code point (in this case it’s the degree sign: °). The current range for the code points is U+0000–U+10FFFF, hence 21 bit (but it’s extendable). So, you see, to say that Unicode is a 16-bit encoding is wrong in several ways: Unicode is a standard (defining multiple encodings), not an encoding, and not all Unicode encodings are 16-bit.

The code points defined in the first Unicode standard now belong to the so-called Basic Multilingual Plane (BMP), and that includes code points in the range U+0000–U+FFFF. That is Latin, English, Arabic, Hebrew, most Chinese and Japanese and lots of other useful things. However, there are some Chinese symbols outside the BMP, which belong to the so-called Supplementary Plane, and the code points U+10000 and above are called supplementary code points (or characters).

There are three main encodings in the current Unicode standard. By “main” I mean that they are both part of the standard and are widely used. These are:

  1. UTF-8, which is a variable width character encoding, where a code point can be represented by one to four bytes (to six bytes if we ever need code points above U+200000). Good thing about it is that NUL byte is only used to represent the NUL code point, so UTF-8 strings can be NUL-terminated. Another good thing is that ASCII characters are represented by single bytes identical to their ASCII representation.
  2. UTF-16, which is also (surprise!) a variable width character encoding, where a code point can be represented by one or two 16-bit code units (which, in turn, can be represented by two bytes using either BE or LE byte order, that makes UTF-16LE and UTF-16BE). BMP code points are represented by one code unit, supplementary code points are represented by the so-called surrogate pairs, which consist of the first (high) surrogate and the second (low) surrogate. The high/low concept doesn’t really have anything to do with byte ordering here, they encode higher and lower bits of the code point, and the high surrogate always comes first regardless of the byte order.
  3. UTF-32 is a fixed width character encoding where each code point is encoded as a single 32-bit number (which, again, makes it UTF-32LE or UTF-32BE depending on the byte order).

As you can see, UTF-16 is pretty messed up, and if you consider it a fixed-width character encoding, you may end up in trouble. In fact, when I finally figured out all this, I started to think that UTF-16 is outright evil: it doesn’t have the nice properties of UTF-8 (like NUL-termination and ASCII compatibility) and its only advantage over UTF-32 is lower memory consumption, but with modern amounts of RAM it shouldn’t be a real problem any more. And the fact that it’s a variable width encoding screwes up almost any text processing algorithm you can think of. Here is a correct solution for the mentioned LeetCode problem, for example.

It’s certainly not as efficient as the others, but it’s the one that really works (and no, you can’t say it works unless it handles all possible inputs correctly). Some useful String and Character methods include:

  • String.codePointCount: returns the number of code points between the specified indexes. This is the true length of the string (not the number returned by String.length).
  • String.offsetByCodePoints: “adds” two indexes together, when one index is a char index and another one is measured in code points, returning the resulting char index. For example, if you have the string “冬b”, then offsetByCodePoints(0, 1) would return 2 because “b” is located at index 2, not 1. A call to offsetByCodePoints(2, 1) would return 3 (the end of the string) because “b” is only a single code unit. This method is kind of reversed version of the previous one.
  • CharSequence.codePoints: returns an IntStream of code points.
  • Character.codePointAt, Character.codePointCount: same as the String method, only for character arrays.
  • Character.highSurrogate, Character.lowSurrogate: return the respective surrogate for a given code point.
  • Character.isHighSurrogate, Character.isLowSurrogate: for a given code unit, check whether it’s a part of a possible surrogate pair. This is very important method for many cases when you need to be able to distinguish surrogate pairs from BMP characters. For example, StringBuilder.reverse uses it to properly reverse a string contains surrogates (because they obviously don’t need to be reversed).

On top of that, many methods have two variants: one accepting a char, other accepting a code point. Those accepting chars should really be deprecated because they actually encourage writing buggy code.

Considering all that, we must conclude that while Unicode indeed made life much easier than it was in the Dark Ages, it must be handled properly unless we want to enter another dark age where a person may fail to register an account on some site simply because he happened to have a supplementary character. Or wait a minute. We have already entered it. Now we must get out, so we all better start writing bug-free code!

Leave a Reply