# The Longest Substring with at Most Two Distinct Characters

Here’s another LeetCode problem. This one is also tagged “hard” for whatever reason, although it looks more like medium to me. I wouldn’t even be posting about it if it wasn’t for the way I arrived at the solution.

The problem is for the given string to find the length of the longest substring that contains at most two distinct characters. That is, for string like “abcdeef” that would be 3 and the possible substrings are “dee” or “eef”.

At first I thought about how the string can be represented as a graph of possible substrings. For the string “eceba”, the graph would look like this:

While this looks like a possible answer because one only has to traverse the graph in the direction that leads to decrease in the number of distinct characters, there are some serious problems with this approach. First of all, it is not clear what the traversal algorithm would be. Which path to choose if none of them decrease the number of distinct characters? Another problem is to how to count them? A frequency table can be computed in linear time, but quickly updating it when moving through the graph is tricky if one was to follow multiple paths at the same time (breadth-first search). So even achieving quadratic time is difficult with this approach, and I had a feeling that this problem can be solved in linear time.

Then I thought about what exactly gave me that feeling. And I realized that the problem resembles regular expression matching a little bit. In fact, for the string “eceba” the problem could be stated as to find the longest possible match of the regular expression “[ec]*|[eb]*|[ea]*|[cb]*|[ca]*”. Regular expression matching can be done in linear time, using non-deterministic finite automata (NFA), for example, but on the other hand even creating such a regular expression would be tedious and pointless.

Then I thought, OK, do I even need to create all the states? No, I don’t. I can create them dynamically as I encounter characters. I start with a single character class, and if I see another character, I add another class for that character and then mutate the previous state into a two-character class state. If I encounter a third character for any state, that state stops matching and I can remove it (checking whether it has the maximum match length so far). This gave me the following solution:

This executed in 23 ms, beating 56% of other solutions. Not bad, but then I realized that I create too many states. What helps is that most of them end up being removed pretty quickly, but it is still possible to create a lot of them. In fact, for a string consisting of a single repeating character, that would blow up into numerous states matching the same character, but starting at different positions. That is, of course, pointless.

But first I decided to get rid of the allocation overhead. Since the maximum number of states is limited to the size of the string, I could use simple arrays to store the states. And get rid of a separate class too. That lead to this:

This gave me 11 ms and beat 78%. Then I thought about reducing the number of states. Two character states are kind of hard to locate and compare. How would I know that I already have a [ca]* state before adding an [ac]* state? But for one-character states that would be easy. Just don’t create a new state when you see the same character again:

But when I executed this I got 41 ms (32%)! Why?! The reason, I suspect, is the branch prediction. Adding additional unpredictable branch into the loop probably made tests containing a lot of short same-character substrings execute much slower. Unfortunately, LeetCode doesn’t publish test cases, so there was no way to profile it. It did, however, executed much faster on a long string consisting of only one repeating character.

Then I decided to look up other solutions. One that I liked in particular is based on having just two indices. Constant memory and linear time. Not bad, eh?

That made me think about optimizing my solution further. I still had a feeling that the number of states may be greatly reduced, but how? And then I thought, how many states can my NFA be in at any given moment? Hey, looks like it’s only two! That’s right, at any given moment my NFA matches one single character class (that is the last character I saw) and possibly one two-character class (the previous two distinct characters I saw). Looks like I don’t even need an array of states! And it also looks like I don’t have to check for maximum each time I inspect a state. I can only check when a state is about to be removed (no longer matches) or when I reach the end of the string.

This is the final solution (5 ms, 96%). And when I actually implemented it, I realized that it is very similar to that two-pointer solution. In fact, it is the same algorithm written in a slightly different way. But the way I arrived at it! Looking at this code, one would never see any NFAs in it, and yet the code is based on the same idea.

It can be made even faster by first converting the string into a character array, like that 100% guy did. But I don’t like this approach much because it essentially trades off infinitesimal performance boost for double memory consumption. Another approach is to use Reflection to access the “value” field of the string, but that is beyond good and evil—it won’t work for any Java implementation that happens to have the field named differently. That kind of dirty hack should be reserved for desperate situations only.

# The Median of Two Sorted Arrays

It’s been a while since I created this site. I’ve been thinking about pulling some bits out of my older sites and blogs to put here before I actually continue on with this thing, but that proved to be rather tedious work. So after almost two years I figured that I’d better start a new life, so to speak, and be done with it.

So here is my first post in this new life. I have recently discovered the LeetCode site which is an awesome tool for getting ready for a software engineer interview. It’s not like I’m going to do one any time soon (although who knows?), but I felt that it’s a great chance to polish up my coding skills, so I went on.

At first I looked up solutions if I was unable to find one myself in a matter of a few hours. After doing that several times, though, I found it disappointing. While some of those problems required knowledge of arcane algorithms that people like Knuth spent tens of years to develop, some others only required hard thinking, and I felt like I was giving up too early. So I started looking harder for solutions and stopped looking them up. In the worst case, when I felt like I need a very specific algorithm to solve a part of the problem, I’d look up that very algorithm and then start to think how I would go about using it in this problem.

One of the problems I had to solve recently was this. Find the median of two sorted arrays of integers. The median is defined as the middle number of the array that would be the result of merging the two arrays together into one sorted array. And if this array would have even number of elements, then the median is the average of the two middle ones. The required time complexity was $O(\log (n+m))$, where n and are array sizes (it would be trivial to solve in linear time).

And this problem baffled me completely. It was pretty obvious that it doesn’t require any fancy algorithms. No knowledge of graphs and trees and whatnot would help me in this problem. The logarithmic complexity requirement obviously meant some sort of a binary search, but how would one go with binary searching two arrays?

My first though was to take medians of the two arrays and compare them. Never mind the odd/even size cases for now. If the medians are equal, then it is probably the answer. The idea is that if I was to merge the arrays, then half of the elements from the second array would go to the left part and half to the right part, so the median will remain the median. But what if they are not equal? Suppose the median of the second array is larger. That means that the second array has more elements that are larger than the median of the first one. It follows that if I was to merge the second array into the first one, then the median of the first one will shift left, so the real median is greater than or at least equal to it. By the same logic, if I was to merge the first array into the second one, it becomes clear that the median must also be less than or at least equal to the median of the second array.

So far so good. That gets rid of the left part of the first array and the right part of the second array. If I am able to continue this process, then I’ll get exactly $log(n+m)$ complexity. But continuing this process proved more complicated than it seemed because it makes no sense any more to pick medians of the remaining arrays. I’d get a wrong answer because once I got rid of some elements, it’s not the median any more that I’m looking for. At least not if the arrays are of different sizes. Now I understand that if I continued with that approach, I’d get the right answer if I was able to figure out what I was looking for exactly. And that’s is the kth smallest element. For the median it’s roughly $\frac{n+m}{2}$th element. And by throwing away some elements, I reduce the problem to the problem of finding $\frac{n+m}{2}-l$th, where l is the number of elements I have thrown away. But at that time I wasn’t able to get a good grasp of the exact nature of the problem, so I got stuck.

The next idea was to use a binary search on the first array. Take the middle element, then check if it’s the median or not. How do I check? Well, the median needs to have exactly the same (plus-minus one if the length is even) amount of elements less than or equal to it to the left and greater than or equal to the right. For any given element of the second array, I know how many elements are to the left and to the right of it are there, but now I need to find out how many of them are in the second array. I can determine that by performing a binary search on the second array. If the would-be median is not there, then I find its insertion point and I know exactly how many elements less than and greater than are there. If it is there, though, then I need to look to the left and to the right of it because there may be many repeating elements that are equal to it. By repeating this process, I’d find the median if it is in the first array. If it’s not, then I need to repeat the whole thing for the second one. And for even sum of lengths I need to do the same thing for the second median element. Since I am performing a binary search, and on each step there is another binary search, the resulting complexity is $O(\log n \log m)$, which is not exactly what we want either.

And then I saw the light. When doing the binary search, I need not calculate how many elements less/greater than or equal to the would-be median there are in the second array! I already know how many are there in the first one, so I know exactly how many I am lacking. What I need to check is whether there is exactly the right amount in the second array. I can determine that instantly by looking at the elements i and i-1 in the second array, where i is the number of the lacking elements to the left of the would-be median. If the i-1th element is less than or equal to the would-be median and the ith element is greater than or equal to it, then it is the median. Note that both of these conditions cannot be false because the array is sorted. If the first condition is false, it means that there are not enough elements in the second array, so the median must be somewhere to the right in the first array (if it’s there in the first place). If the second condition is false, it means that there are too many elements and the median is to the left.

If the process finds the median, then all is good. If it doesn’t then I now have an insertion point, that is, the place where it would be if it was there. But that means I know exactly how many elements less than median are there in the first array! That gives me the location of the median in the second array instantly! Bingo!

Moreover, since I am performing a binary search on only one array, I might as well just pick the smallest one. That gives $O(\log \min (m, n))$ complexity! Although it’s only about twice as faster than the required one if the sizes are of the same order.

And to make things even better, I don’t need to repeat the process twice for even sizes. If I find the first median, then the next one must be right after it in one of the arrays. And since I know the position of the first median in one array and the insertion point in another, I can just check both, and pick the minimum one.

The full code for this solution is below. Somehow it beats less than 5% of other solutions on LeetCode, but that doesn’t mean much for two reasons. First is that people tend to take the best solution published by other submitters and submit them as their own simply to check whether that solution is really that fast (there is no way to figure that out without submitting). That leads to large peaks at the best solutions. Another reason is that my solution is 8 ms, while the best ones are 5 ms. But 3 ms is very little difference to measure precisely and I don’t know exactly what the margin of error is.