# Encodings, Unicode and broken code

This is another sad tale of character encodings. Consider this LeetCode problem that asks to check whether the given strings are isomorphic. Isomorphic strings being defined as strings of the same length with a bijection mapping between the characters. For example, “aba” is isomorphic to “ava” with mapping $a \leftrightarrow a, b \leftrightarrow v$ and “mlm” with mapping $a \leftrightarrow m, b \leftrightarrow l$, but not to “aaa” (no bijection since both “a” and “b” are mapped to “a”).

Now consider possible solutions in Java. One obvious solution:

Runs in 36 ms, certainly not the fastest submission. One way to “optimize” it:

This runs in 12 ms. Three times faster! Beating 92%! And here is yet another version:

Now, let me ask a question: which of the solutions above is the best one?

The last one is something an English speaker with C background might come up with. It will obviously break for any characters outside US-ASCII, including Cyrillic, Chinese, Hebrew or even German or Irish (because of the umlauts and fada). So obviously it’s not acceptable.

The second one is trickier. One thing is that it might break if one of the strings contains NUL characters because we abuse NUL as the “character not mapped” special value. Another thing is that initializing the whole array with zeroes takes $\mathcal{O}(65536)$ time which could make it a poor choice for short strings.

So it looks like the first one is the best, right? It scales nice to any lengths, and even though it’s slower, it handles NULs properly.

Well, the answer is: all of them are wrong! One test case that none of the solutions above will pass is “ab”, “冬b”. In case you can’t see it, here is a picture: That’s right, that one weird Chinese character is enough to break all of the solutions above. Moreover, it breaks LeetCode testing system as well (just like Cyrillic or anything non-ASCII does) and LeetCode Discuss forums too (unlike Cyrillic and many other non-ASCII symbols). Why? What’s wrong with that particular character? Java stores strings using Unicode, right? That’s why char is two bytes, after all! So it should be able to handle any characters without any problems! The dark age of terrible national encodings is over!

In order to understand it, we must look back at the history of encodings and Uncode.

It all started in 1960s or even long before that (Morse code came into existence long before the first computer). But it’s in 1960s that all hell broke loose. In 1963 both ASCII and EBCDIC were introduced. While even EBCDIC is apparently still in use today, it’s ASCII that became widespread, and the fact that ASCII was a 7-bit encoding meant that there was one “free” bit and 128 unused codes in the 128-255 range. That, and the lack of any letters except basic Latin, immediately gave birth to a myriad of various national encodings. Worse, multiple encodings were sometimes used for the same languages. I know of four Russian, for example: code page 866 (“MS-DOS” encoding), code page 1251 (”Windows” or ANSI encoding), KOI8-R (a really weird encoding that arranges letter according to English alphabet, not Russian one, was really widespread in the early days of Russian Internet) and the “standard” ISO-8859-5 that was rarely used at all. This is still a major source of various troubles, as when you run a program in a console window, you have no idea which encoding will be used and therefore you have about 50% chance of getting garbage (less in practice because most programs will use the MS-DOS encoding). And nobody plans to fix it because it is impossible and because nobody cares about console windows nowadays.

Chinese and Japanese people got it even worse: 128 values are obviously not enough to represent about 2000 ideographs in Japanese (and that’s only a subset of Chinese!), so they went ahead and invented two-byte encodings, which made things much worse because now, having some bytes, you couldn’t even determine the string length if you had no idea which encoding is used.

Then Unicode came into being. The first standard was published in 1991 and it introduced a 16-bit encoding intended for universal use, which included all characters deemed reasonable. Unfortunately, the bunch of Old Evil Encodings didn’t disappear at the very same moment, so the only thing that really happened that day is that the world now had one more encoding to deal with. No, wait, make it two encodings because Unicode defined characters as 16-bit units, but those can be represented with bytes using either Little Endian or Big Endian order.

Even worse, Unicode apparently failed to consider some important characters like rarely used ideographs (like that 冬), even though they are a part of personal names and names of places. Imagine you can’t type your own name as you’re trying to use some software! So apparently some extension was needed. That is how Unicode transformed from a single 16-bit encoding into a whole standard of concepts and encodings.

The core concept is the code point. A code point is a 21-bit number corresponding to some character, typically represented as a 32-bit integer in memory and as something like U+00B0 in writing, where 00B0 is the hexadecimal of the code point (in this case it’s the degree sign: °). The current range for the code points is U+0000–U+10FFFF, hence 21 bit (but it’s extendable). So, you see, to say that Unicode is a 16-bit encoding is wrong in several ways: Unicode is a standard (defining multiple encodings), not an encoding, and not all Unicode encodings are 16-bit.

The code points defined in the first Unicode standard now belong to the so-called Basic Multilingual Plane (BMP), and that includes code points in the range U+0000–U+FFFF. That is Latin, English, Arabic, Hebrew, most Chinese and Japanese and lots of other useful things. However, there are some Chinese symbols outside the BMP, which belong to the so-called Supplementary Plane, and the code points U+10000 and above are called supplementary code points (or characters).

There are three main encodings in the current Unicode standard. By “main” I mean that they are both part of the standard and are widely used. These are:

1. UTF-8, which is a variable width character encoding, where a code point can be represented by one to four bytes (to six bytes if we ever need code points above U+200000). Good thing about it is that NUL byte is only used to represent the NUL code point, so UTF-8 strings can be NUL-terminated. Another good thing is that ASCII characters are represented by single bytes identical to their ASCII representation.
2. UTF-16, which is also (surprise!) a variable width character encoding, where a code point can be represented by one or two 16-bit code units (which, in turn, can be represented by two bytes using either BE or LE byte order, that makes UTF-16LE and UTF-16BE). BMP code points are represented by one code unit, supplementary code points are represented by the so-called surrogate pairs, which consist of the first (high) surrogate and the second (low) surrogate. The high/low concept doesn’t really have anything to do with byte ordering here, they encode higher and lower bits of the code point, and the high surrogate always comes first regardless of the byte order.
3. UTF-32 is a fixed width character encoding where each code point is encoded as a single 32-bit number (which, again, makes it UTF-32LE or UTF-32BE depending on the byte order).

As you can see, UTF-16 is pretty messed up, and if you consider it a fixed-width character encoding, you may end up in trouble. In fact, when I finally figured out all this, I started to think that UTF-16 is outright evil: it doesn’t have the nice properties of UTF-8 (like NUL-termination and ASCII compatibility) and its only advantage over UTF-32 is lower memory consumption, but with modern amounts of RAM it shouldn’t be a real problem any more. And the fact that it’s a variable width encoding screwes up almost any text processing algorithm you can think of. Here is a correct solution for the mentioned LeetCode problem, for example.

It’s certainly not as efficient as the others, but it’s the one that really works (and no, you can’t say it works unless it handles all possible inputs correctly). Some useful String and Character methods include:

• String.codePointCount: returns the number of code points between the specified indexes. This is the true length of the string (not the number returned by String.length).
• String.offsetByCodePoints: “adds” two indexes together, when one index is a char index and another one is measured in code points, returning the resulting char index. For example, if you have the string “冬b”, then offsetByCodePoints(0, 1) would return 2 because “b” is located at index 2, not 1. A call to offsetByCodePoints(2, 1) would return 3 (the end of the string) because “b” is only a single code unit. This method is kind of reversed version of the previous one.
• CharSequence.codePoints: returns an IntStream of code points.
• Character.codePointAt, Character.codePointCount: same as the String method, only for character arrays.
• Character.highSurrogate, Character.lowSurrogate: return the respective surrogate for a given code point.
• Character.isHighSurrogate, Character.isLowSurrogate: for a given code unit, check whether it’s a part of a possible surrogate pair. This is very important method for many cases when you need to be able to distinguish surrogate pairs from BMP characters. For example, StringBuilder.reverse uses it to properly reverse a string contains surrogates (because they obviously don’t need to be reversed).

On top of that, many methods have two variants: one accepting a char, other accepting a code point. Those accepting chars should really be deprecated because they actually encourage writing buggy code.

Considering all that, we must conclude that while Unicode indeed made life much easier than it was in the Dark Ages, it must be handled properly unless we want to enter another dark age where a person may fail to register an account on some site simply because he happened to have a supplementary character. Or wait a minute. We have already entered it. Now we must get out, so we all better start writing bug-free code!

# Best Time to Buy and Sell Stock IV

Another awesome problem on LeetCode deals with buying and selling stock. You are given a list of prices, are allowed to only open long positions and you must close one before opening a new one. This would be trivial (buy on low, sell on high), but you’re limited to a total of k buy/sell transaction pairs. You need to return the maximum profit.

At first I thought it was a dynamic programming problem. Quick peek at the tags confirmed that suspicion. I quickly realized that subproblems require to determine the maximum profit for d days, starting with 2 (can’t make a profit on one day because only one price per day is given). Moreover, it is quite obvious that we have to determine maximum profits for limited numbers of transactions in range 0..k. This requires only k memory because you only need the profits for the previous day to compute the tomorrow profits. You need to consider, though, that you may have left your position open, so some potential profit may exist. In this case you also have to record the opening price.

In the end it looked like this:

Not very elegant, but pretty straightforward. profitClosed[j] is the maximum profit that may be made by today with j buys/sells, while leaving no open position. profitOpen[j] is the same thing, but with open position and openPrice[j] denotes the best buying price so far, so we can instantly gain profitOpen[j] + current price − openPrice[j] by selling today. And that’s what we do, but only if the resulting profit is more than we could gain by performing less transaction. When we do it, the number of elements in the profitClosed array becomes greater than the number of elements in profitOpen, so we immediately open a new position on the next loop.

Then we update our profits in the inner loop. We re-close an open position if we can get a better profit and we reopen if we can get a better profit plus potential profit! That is important because that’s what an open position is about: if we can instantly get that much profit, it would be a waste to reopen at the current price even if we can get a better profit today, it will still bite us in the future.

This solution runs from 4 to 6 ms depending on the weather on Mars and the number of holes in the cheese. For example, replacing > profitClosed[lClosed] with > 0  in the pre-loop close-another-position condition speeds it up (branch prediction?) even though we open not very profitable positions.

There are much better solutions using the same idea. This one is particularly awesome, as it does both loops in just four lines.

However, I didn’t particularly like the DP idea here. It felt like this problem should have a better solution, so I kept on digging. After seeing this one, I decided to do the same thing in Java:

Not terribly concise, but pretty straightforward. The idea is that we calculate the profit that we’ll lose if we do nothing on a particular day, thus reducing the number of transactions. Then we keep on throwing away those days starting with those that give the least profit. Obviously throwing away days on monotonous intervals won’t affect the profit at all, so we don’t even add those to begin with. We are only interested in “peaks” and “valleys” (plus possibly the first and the last days).

Maybe replacing indices arrays with a linked list was a bad idea. It’s definitely worth to try arrays instead. As it is, it runs in 20 ms. Not so terribly efficient. But I don’t think switching to array will improve it that much because the main slowdown here is the tree. Even though I only add peak/valley days and do it only if transaction reduction is needed, it is still a pretty slow thing. And it still felt like not the totally right thing.

Then I saw this solution based on another one. And I must admin, it really took me a while to figure these out. Especially loop invariants. So I hereby present my own implementation of the same ideas that runs in 3 ms $\mathcal{O}(n)$ (if we assume quickselect is linear, which is a fair assumption since randomization provides a very high average case probability).

And here comes the explanation. Consider this price chart. Let’s consider only closing prices (80 for the first day, 70 for the 2nd, 75 for the 3rd and so on). For a “bullish” day (green) the closing price is the top of the candle, for a “bearish” one it’s the bottom. If we were to aim for the maximum profit, we’d perform the following transactions:

1. Days 2–6, prices 70–100, profit 30.
2. Days 8–11, prices 50–80, profit 30.
3. Days 13–17, prices 20–60, profit 40.
4. Days 19–21, prices 40–50, profit 10.
5. Days 23–27, prices 30–70, profit 40.
6. Days 29–31, prices 10–100, profit 90.

The total profit is 30+30+40+10+40+90=240.

Now, we are limited to some number of transactions. If that’s only one, then the best we can get is the last transaction, that is, 90. If two, then the best is to buy on the 13th day, sell on the 27nd (profit 50), buy on the 29th and finally sell on the 31st for the total profit of 140. Note that the first transaction is not even among the list of transactions we’d perform if we weren’t limited. That is because the first three transactions (green runs on the chart) are best united into a single one. So we need to figure out which transactions to unite.

Consider the following problem. For given intervals [v1, p1], [v2, p2], v1 < p2, v2 < p2, where “v” stands for a valley and “p” for a peak, what are the possible relationships between them? There are six:

1. They don’t overlap, and p1 ≤ v2. For example, days 13–15 and 15–17 with prices [20,40], [40,60].
2. They don’t overlap, and v1 ≥ p2. For example, days 2–6 and 13–17 with prices [70,100], [20,60].
3. They do overlap, and v2 < p1. For example, days 13–17 and 23–27 with prices [20,60], [30,70].
4. They overlap, and v1 < p2. For example, days 2–6 and 8–11 with prices [70,100], [50,80].
5. The second interval is fully included in the first one. For example, days 13–17 and 19–21 with prices [20,60], [40, 50].
6. The first interval is fully included in the second one. For example, days 23–27 and 29–31 with prices [30,70], [10,100].

I’m not totally rigorous here about the strictly-less/less-than-or-equal thing. It doesn’t matter, though, because corner cases when something is equal to something else may be handled as either—they are kind of “at the border” between the two and belong to both sets. For example, if two transactions have exactly the same price range, then it doesn’t really matter whether you think of the second one included in the first one or vice versa. Or you may even want to consider them to be overlapping.

Now we need to ask ourselves a question: if we have two transactions and are allowed to make only one, how to get the maximum profit? Let’s consider all six cases. Case 1: transactions (1) and (2) are combined into (3)

The first case is not really possible for adjacent transactions if we only consider “peak—valley” transactions to begin with (the end of the second transaction is not really a peak, and the beginning of the second one is not a valley). Indeed, if two intervals form a monotonous non-decreasing sequence, then why bother splitting them in two intervals at all? However, for transactions far apart, this case is still possible (although not in our example). Anyway, in this case the answer is quite obvious: just combine two transactions into a single one, buying on for v1, selling for p2. Case 2: between transactions (1) and (2) we pick (2) because it’s more profitable

The second case is also obvious: just pick up the most profitable one. We can’t unite them because the starting price of the first one is greater than the selling price of the second one. If you buy on the 2nd day and sell on 17th, you’ll have 10 loss instead of any profit. Case 3: two overlapping transactions (1) and (2) are transformed into one long transaction (3) and an imaginary short transaction (4)

The third case is the most tricky one! Since the lowest price is v1, and the highest price is p2, then the best is to combine them into a single transaction. This sounds like it increases the transaction count tremendously (as we have to consider all possible combinations), but in fact it doesn’t. We do a really amazing trick here: instead of considering them as two separate transactions to begin with, we instead think of them as one “long” transaction (buying at v1, selling at p2) and one “short” transaction (selling at p1 and buying later at v2), even though short transactions aren’t technically allowed by this problem! This works because we pick transactions starting with the ones giving us the most profit. In this case, the long one is guaranteed to give us more profit than the short one, so we’ll either pick the long one (without violating anything) or pick both. And since both transactions give us exactly the same profit as two long transactions we had to begin with (p1-v1+p2-v2=p2-v1+p1-v2), then it’d appear in the net result as if we performed two separate long transactions. Case 4: between transactions (1) and (2) we choose the most profitable (in this case any of them) because the combined transaction (3) is the least profitable

The fourth case is trivial: since the combined transaction is the least profitable, we just pick the most profitable one of the two. In our example they are equally profitable, though. Case 5: the second transaction is included in the first, so the first one gives the best profit we can get

The fifth case is even more boring: the first transaction has both the lowest price and the highest price, so we just pick the first one. Case 6: the second transaction gives the most profit

The sixth case is the mirror image of the fifth one. Just pick the second transaction.

So we have one case when we should combine transactions unconditionally, two cases when we should choose the most profitable of the two, two cases when the most profitable one is obvious and one tricky case when we transform two long transactions into one long and one imaginary short one. Now to the algorithm.

The algorithm preserves the following invariant. When an outer loop iteration finishes, the valleys/peaks stack contains transactions that are related according to case 5 above, with the latest transaction on the top. The invariant is obviously true at the beginning since the stack is empty, and any proposition is true for the elements of an empty set (all humans that have visited other galaxies have green ears—there are none at the moment of writing, so I’m not wrong in saying that “all” (zero) of them have green ears).

The invariant is preserved by these two loops:

The first loop just pops transactions with v1 > v2, which corresponds to cases 2, 4 and 6. In all these cases we don’t combine transactions, so it’s fine to just pop them and consider separately. The termination condition guarantees that v1 ≤ v2 at the end of that loop, assuming there are any elements left in the stack.

The second loop handles the tricky case 3. We pop a transaction, then generate a “short” transaction and put it into the profits array. Then we set the current valley to the one popped from the stack, so the current transaction now corresponds to the long one. Then we continue the loop because it may or may not be possible to combine it with the previous one and so on. Note that the termination condition guarantees that p1 > p2 at the end of the loop, assuming there are any elements left. Assuming the invariant held true at the start of the outer loop, popping some elements could not have increased the valley at the top of the stack because it may only decrease as we go deeper.

Note that we don’t consider case 1 separately. Instead, we treat it as a special subcase of case 3 where the “short” transaction gives us negative profit. Since we’re going to pick only the highest profits anyway, this is fine. One may think that it may have negative impact on the total profit in the case where k is high enough to allow all transactions to complete, but in fact it may not. This is because two adjacent transactions can’t form case 1 anyway, remember? So when we have transactions like this, it means that they are separated by some transactions in the middle. And if we’re going to pick every profitable transaction, then we can’t really combine those two because we aren’t allowed to engage in multiple transactions. So this negative profit corresponds to the profit loss caused by the fact that we need to sell first in order to buy again.

Lastly, since case 5 corresponds to the items in the stack, we pick top k profits using randomized quickselect and sum them up.

One last note before we get to the example: the loop may generate one false peak/valley pair at the end of the input array if the prices list ends with a valley. This corresponds to a zero-length transaction giving zero profit, so instead of checking for this corner case we can allocate one more element in the profits array to hold this zero. It won’t affect the result in any way.

Now let’s see how the algorithm works on our example. Our valleys/peaks are: 70–100, 50–80, 20–60, 40–50, 30–70, 10–100.

Step 1 (70–100). The stack is empty, so the inner loops don’t execute. The stack becomes (bottom-to-top)

Step 2 (50–80). The first loop pops the transaction because 70 > 50 (case 4). The stack becomes empty, therefore the second loop doesn’t execute. The next transaction is pushed into the stack.

Step 3 (20–60). The first loop pops because 50 > 20 (case 4). The rest is just like the previous step.

Step 4 (40–50). The first loop doesn’t work because 20 < 40. The second loop doesn’t work either because 60 > 50. Case 5.

Step 5 (30–70). The first loop pops only [40,50] because 40 > 30 (case 6), but 20 < 30. The second loop then transforms [20,60] and [30,70] into [60,30] (added to the profits) and [20,70] (becomes the current transaction). This is case 3.

Step 6 (10–100). The first loop pops because 20 > 10 (case 6). The second loop doesn’t work because the stack is now empty. The last transaction is pushed into the stack.

Lastly, we pop the stack and we now have

A total of six transactions! Now for the different values of k we have:

1. Just pick up the last transaction.
2. Pick up 50 and 90, where 50 corresponds to 20–70 produced on the 5th step (buy on the 13th, sell on the 27th).
3. Pick up 50, 90 and any of the 30s, where 30s can be one of the first two transactions or the imaginary transaction that splits our 50 into 20–60 (days 13–17) and 30–70 (days 23-27), which gives 40+40 = 50+30.
4. Pick up 50, 90 and any two of the 30s.
5. Pick up 50, 90 and all of the 30s (meaning 50+30 now definitely means two real transactions 40+40).
6. Pick all of them (the trivial unrestricted case).

Isn’t this awesome?

# Closest Binary Search Tree Value II

Continuing on to the next interesting problem that took me a while to solve because I never actually done iterative tree traversals before, although I was aware that there is such a thing and knew the general idea how to do it.

The problem is to find k values in a BST that are closest to the given target. The target is a double, and the nodes are integers so the tree may or may not contain the target itself.

The linear solution is so trivial that I didn’t even try to do it. Indeed, just perform an in-order traversal, keep the last k values in some sort of ring buffer array and terminate when the next value is worse than the worst so far.

What was interesting is how to do that faster. Or not really faster because the test cases seem to be tailored for the linear solution, but still, how to get better time complexity?

The idea is pretty obvious. We need to find the target or some value close to the target (previous or next, doesn’t matter) and then look in the neighborhood for k closest values. But how do we look in the neighborhood. If we do recursive binary search, then we’ll lose all information about where we find it once the recursion returns. The best we can get is a reference to the found node, but no way to get back up. So it looks like we need iterative binary search.

The iterative binary search itself is very, very easy. But we also need some way to keep track of where we are, so we need a stack. The first part would look somewhat like this then:

This locates either the target or the next value before or after the target. So now what? And here is where I got lost. A typical iterative in-order traversal would look like this, if starting from the root:

However, the stack here and the stack I got from the binary search above is not the same stack! Here we save on stack space by only pushing previous elements if we know we’ll have to get back to them eventually. And we know that it’s only when we go left, because when we go right, we won’t have to return to the already processed elements. So when we hit a dead end when going right and pop the next element from the stack, it magically takes us to the parent of the current subtree, not the parent of the current element.

For example, consider the following tree. If we traverse this tree using the algorithm above, we first push left children into the stack until we hit null. That gives us the following pictures: And then we start popping elements and processing them (the second branch). The first element is a leaf, so it doesn’t have a right child, and therefore on the next iteration we pop another one, thus processing -1, -2:

The current element (-2) now has a right child, so we go there instead and push it onto stack before going left. However, since there is nowhere to go, we immediately pop it back and process it:

Now look at this! The current element is -1, but the stack only contains the root! So the next thing we do is pop it and jump all the way back up, just right to the next element in the sequence.

However, when performing a binary search and pushing elements to the stack, we never get a stack that looks like this. In fact, the stack always contains every element in the sequence leading up to the root, node-by-node. So at first glance, the stack I got by performing binary search, was not very suitable for this in-order traversal algorithm. Indeed, if I were looking for, say, -1.5, then I’d end up with something like this: Here the green element is the current element at which the search stops. In fact, I now have the two closest elements to the target: one is at the top of the stack, another one is the leaf I’m at. However, if I was to perform an in-order traversal using the algorithm above, I’d quickly end up in trouble. What would the algorithm do? It would first push -1 to the stack. Then it’d pop and process it. Then it’d try to go right, but there is nowhere to go. So it’d pop another element from the stack. But that’s -2, which doesn’t come in the right order!

Now at this point, what I was supposed to do, and what most people at LeetCode did, is to create two stacks instead of one when performing the binary search. Then the whole thing would look like this:

Now why does this work? It works because the two stacks follow the same rule as the stacks used during the classic iterative traversal: when we do in-order traversal, we only push elements to the stack if we are going left, and that’s exactly what stackGT follows here, so it can be later used to perform an in-order traversal starting at the point where we stopped. The same goes for stackLE, but for reverse in-order.

But as I’ve said, I didn’t quite get it at the time, so I invented a rather unorthodox approach. I noticed that even though the stack I got was unsuitable for the typical space-saving concise iterative algorithm, it was just the type of the stack used for recursive solutions! Indeed, recursion always unavoidably stores everything in the stack simply because there’s no way around that. Consider a typical recursive algorithm:

How does it work for the tree above? It pushes 0 first. Then it goes left. There it pushes -2. Goes left again. Pushes -3. There is no left, so it processes -3 and tries to go right. But there is no right, so it pops -3 and goes up. So far it’s no different from the iterative approach.

But then things change. When it returns to -2, it processes it without actually popping it and then pushes -1. So at that point we end up with exactly the same stack as in the binary search algorithm. Why is that? How is that it works and the iterative algorithm doesn’t?

The answer is that the iterative algorithm lacks one thing: the return address. Remember that the computer doesn’t only push function arguments to the stack. It also pushes the return address, so when it pops the stack, it instantly knows what to do next: process the current value (if returning from the left subtree) or exit (if returning from the right one).

So it looks like I could add some imaginary return address to the stack in order to fully emulate the recursive algorithm. But I could do even better than that. When I pop the stack, I can compare left and right references of the popped node to the current node. If the left reference equals to the current one, that means I am returning from the left subtree, otherwise I am returning from the right one.

Note that I still need to perform two traversals: in-order and reverse in-order to locate the closest elements. So even if I have just one stack during the binary search, I need to make a copy of it to proceed further. The resulting solution was this:

It isn’t as beautiful as the “right”, but has the same time complexity and is a fun one. Runs in the same time (6 ms). I have no idea why most solutions only run for 5 ms, but then again it’s probably within the margin of error.

Either way, now I’m more than familiar with both recursive and iterative traversal algorithms and can even emulate one through the other, so that’s one step further towards becoming a better programmer.

# The Longest Substring with at Most Two Distinct Characters

Here’s another LeetCode problem. This one is also tagged “hard” for whatever reason, although it looks more like medium to me. I wouldn’t even be posting about it if it wasn’t for the way I arrived at the solution.

The problem is for the given string to find the length of the longest substring that contains at most two distinct characters. That is, for string like “abcdeef” that would be 3 and the possible substrings are “dee” or “eef”.

At first I thought about how the string can be represented as a graph of possible substrings. For the string “eceba”, the graph would look like this: While this looks like a possible answer because one only has to traverse the graph in the direction that leads to decrease in the number of distinct characters, there are some serious problems with this approach. First of all, it is not clear what the traversal algorithm would be. Which path to choose if none of them decrease the number of distinct characters? Another problem is to how to count them? A frequency table can be computed in linear time, but quickly updating it when moving through the graph is tricky if one was to follow multiple paths at the same time (breadth-first search). So even achieving quadratic time is difficult with this approach, and I had a feeling that this problem can be solved in linear time.

Then I thought about what exactly gave me that feeling. And I realized that the problem resembles regular expression matching a little bit. In fact, for the string “eceba” the problem could be stated as to find the longest possible match of the regular expression “[ec]*|[eb]*|[ea]*|[cb]*|[ca]*”. Regular expression matching can be done in linear time, using non-deterministic finite automata (NFA), for example, but on the other hand even creating such a regular expression would be tedious and pointless.

Then I thought, OK, do I even need to create all the states? No, I don’t. I can create them dynamically as I encounter characters. I start with a single character class, and if I see another character, I add another class for that character and then mutate the previous state into a two-character class state. If I encounter a third character for any state, that state stops matching and I can remove it (checking whether it has the maximum match length so far). This gave me the following solution:

This executed in 23 ms, beating 56% of other solutions. Not bad, but then I realized that I create too many states. What helps is that most of them end up being removed pretty quickly, but it is still possible to create a lot of them. In fact, for a string consisting of a single repeating character, that would blow up into numerous states matching the same character, but starting at different positions. That is, of course, pointless.

But first I decided to get rid of the allocation overhead. Since the maximum number of states is limited to the size of the string, I could use simple arrays to store the states. And get rid of a separate class too. That lead to this:

This gave me 11 ms and beat 78%. Then I thought about reducing the number of states. Two character states are kind of hard to locate and compare. How would I know that I already have a [ca]* state before adding an [ac]* state? But for one-character states that would be easy. Just don’t create a new state when you see the same character again:

But when I executed this I got 41 ms (32%)! Why?! The reason, I suspect, is the branch prediction. Adding additional unpredictable branch into the loop probably made tests containing a lot of short same-character substrings execute much slower. Unfortunately, LeetCode doesn’t publish test cases, so there was no way to profile it. It did, however, executed much faster on a long string consisting of only one repeating character.

Then I decided to look up other solutions. One that I liked in particular is based on having just two indices. Constant memory and linear time. Not bad, eh?

That made me think about optimizing my solution further. I still had a feeling that the number of states may be greatly reduced, but how? And then I thought, how many states can my NFA be in at any given moment? Hey, looks like it’s only two! That’s right, at any given moment my NFA matches one single character class (that is the last character I saw) and possibly one two-character class (the previous two distinct characters I saw). Looks like I don’t even need an array of states! And it also looks like I don’t have to check for maximum each time I inspect a state. I can only check when a state is about to be removed (no longer matches) or when I reach the end of the string.

This is the final solution (5 ms, 96%). And when I actually implemented it, I realized that it is very similar to that two-pointer solution. In fact, it is the same algorithm written in a slightly different way. But the way I arrived at it! Looking at this code, one would never see any NFAs in it, and yet the code is based on the same idea.

It can be made even faster by first converting the string into a character array, like that 100% guy did. But I don’t like this approach much because it essentially trades off infinitesimal performance boost for double memory consumption. Another approach is to use Reflection to access the “value” field of the string, but that is beyond good and evil—it won’t work for any Java implementation that happens to have the field named differently. That kind of dirty hack should be reserved for desperate situations only.

# The Median of Two Sorted Arrays

It’s been a while since I created this site. I’ve been thinking about pulling some bits out of my older sites and blogs to put here before I actually continue on with this thing, but that proved to be rather tedious work. So after almost two years I figured that I’d better start a new life, so to speak, and be done with it.

So here is my first post in this new life. I have recently discovered the LeetCode site which is an awesome tool for getting ready for a software engineer interview. It’s not like I’m going to do one any time soon (although who knows?), but I felt that it’s a great chance to polish up my coding skills, so I went on.

At first I looked up solutions if I was unable to find one myself in a matter of a few hours. After doing that several times, though, I found it disappointing. While some of those problems required knowledge of arcane algorithms that people like Knuth spent tens of years to develop, some others only required hard thinking, and I felt like I was giving up too early. So I started looking harder for solutions and stopped looking them up. In the worst case, when I felt like I need a very specific algorithm to solve a part of the problem, I’d look up that very algorithm and then start to think how I would go about using it in this problem.

One of the problems I had to solve recently was this. Find the median of two sorted arrays of integers. The median is defined as the middle number of the array that would be the result of merging the two arrays together into one sorted array. And if this array would have even number of elements, then the median is the average of the two middle ones. The required time complexity was $O(\log (n+m))$, where n and are array sizes (it would be trivial to solve in linear time).

And this problem baffled me completely. It was pretty obvious that it doesn’t require any fancy algorithms. No knowledge of graphs and trees and whatnot would help me in this problem. The logarithmic complexity requirement obviously meant some sort of a binary search, but how would one go with binary searching two arrays?

My first though was to take medians of the two arrays and compare them. Never mind the odd/even size cases for now. If the medians are equal, then it is probably the answer. The idea is that if I was to merge the arrays, then half of the elements from the second array would go to the left part and half to the right part, so the median will remain the median. But what if they are not equal? Suppose the median of the second array is larger. That means that the second array has more elements that are larger than the median of the first one. It follows that if I was to merge the second array into the first one, then the median of the first one will shift left, so the real median is greater than or at least equal to it. By the same logic, if I was to merge the first array into the second one, it becomes clear that the median must also be less than or at least equal to the median of the second array.

So far so good. That gets rid of the left part of the first array and the right part of the second array. If I am able to continue this process, then I’ll get exactly $log(n+m)$ complexity. But continuing this process proved more complicated than it seemed because it makes no sense any more to pick medians of the remaining arrays. I’d get a wrong answer because once I got rid of some elements, it’s not the median any more that I’m looking for. At least not if the arrays are of different sizes. Now I understand that if I continued with that approach, I’d get the right answer if I was able to figure out what I was looking for exactly. And that’s is the kth smallest element. For the median it’s roughly $\frac{n+m}{2}$th element. And by throwing away some elements, I reduce the problem to the problem of finding $\frac{n+m}{2}-l$th, where l is the number of elements I have thrown away. But at that time I wasn’t able to get a good grasp of the exact nature of the problem, so I got stuck.

The next idea was to use a binary search on the first array. Take the middle element, then check if it’s the median or not. How do I check? Well, the median needs to have exactly the same (plus-minus one if the length is even) amount of elements less than or equal to it to the left and greater than or equal to the right. For any given element of the second array, I know how many elements are to the left and to the right of it are there, but now I need to find out how many of them are in the second array. I can determine that by performing a binary search on the second array. If the would-be median is not there, then I find its insertion point and I know exactly how many elements less than and greater than are there. If it is there, though, then I need to look to the left and to the right of it because there may be many repeating elements that are equal to it. By repeating this process, I’d find the median if it is in the first array. If it’s not, then I need to repeat the whole thing for the second one. And for even sum of lengths I need to do the same thing for the second median element. Since I am performing a binary search, and on each step there is another binary search, the resulting complexity is $O(\log n \log m)$, which is not exactly what we want either.

And then I saw the light. When doing the binary search, I need not calculate how many elements less/greater than or equal to the would-be median there are in the second array! I already know how many are there in the first one, so I know exactly how many I am lacking. What I need to check is whether there is exactly the right amount in the second array. I can determine that instantly by looking at the elements i and i-1 in the second array, where i is the number of the lacking elements to the left of the would-be median. If the i-1th element is less than or equal to the would-be median and the ith element is greater than or equal to it, then it is the median. Note that both of these conditions cannot be false because the array is sorted. If the first condition is false, it means that there are not enough elements in the second array, so the median must be somewhere to the right in the first array (if it’s there in the first place). If the second condition is false, it means that there are too many elements and the median is to the left.

If the process finds the median, then all is good. If it doesn’t then I now have an insertion point, that is, the place where it would be if it was there. But that means I know exactly how many elements less than median are there in the first array! That gives me the location of the median in the second array instantly! Bingo!

Moreover, since I am performing a binary search on only one array, I might as well just pick the smallest one. That gives $O(\log \min (m, n))$ complexity! Although it’s only about twice as faster than the required one if the sizes are of the same order.

And to make things even better, I don’t need to repeat the process twice for even sizes. If I find the first median, then the next one must be right after it in one of the arrays. And since I know the position of the first median in one array and the insertion point in another, I can just check both, and pick the minimum one.

The full code for this solution is below. Somehow it beats less than 5% of other solutions on LeetCode, but that doesn’t mean much for two reasons. First is that people tend to take the best solution published by other submitters and submit them as their own simply to check whether that solution is really that fast (there is no way to figure that out without submitting). That leads to large peaks at the best solutions. Another reason is that my solution is 8 ms, while the best ones are 5 ms. But 3 ms is very little difference to measure precisely and I don’t know exactly what the margin of error is.