The Longest Substring with at Most Two Distinct Characters

Here’s another LeetCode problem. This one is also tagged “hard” for whatever reason, although it looks more like medium to me. I wouldn’t even be posting about it if it wasn’t for the way I arrived at the solution.

The problem is for the given string to find the length of the longest substring that contains at most two distinct characters. That is, for string like “abcdeef” that would be 3 and the possible substrings are “dee” or “eef”.

At first I thought about how the string can be represented as a graph of possible substrings. For the string “eceba”, the graph would look like this:

strings

While this looks like a possible answer because one only has to traverse the graph in the direction that leads to decrease in the number of distinct characters, there are some serious problems with this approach. First of all, it is not clear what the traversal algorithm would be. Which path to choose if none of them decrease the number of distinct characters? Another problem is to how to count them? A frequency table can be computed in linear time, but quickly updating it when moving through the graph is tricky if one was to follow multiple paths at the same time (breadth-first search). So even achieving quadratic time is difficult with this approach, and I had a feeling that this problem can be solved in linear time.

Then I thought about what exactly gave me that feeling. And I realized that the problem resembles regular expression matching a little bit. In fact, for the string “eceba” the problem could be stated as to find the longest possible match of the regular expression “[ec]*|[eb]*|[ea]*|[cb]*|[ca]*”. Regular expression matching can be done in linear time, using non-deterministic finite automata (NFA), for example, but on the other hand even creating such a regular expression would be tedious and pointless.

Then I thought, OK, do I even need to create all the states? No, I don’t. I can create them dynamically as I encounter characters. I start with a single character class, and if I see another character, I add another class for that character and then mutate the previous state into a two-character class state. If I encounter a third character for any state, that state stops matching and I can remove it (checking whether it has the maximum match length so far). This gave me the following solution:

This executed in 23 ms, beating 56% of other solutions. Not bad, but then I realized that I create too many states. What helps is that most of them end up being removed pretty quickly, but it is still possible to create a lot of them. In fact, for a string consisting of a single repeating character, that would blow up into numerous states matching the same character, but starting at different positions. That is, of course, pointless.

But first I decided to get rid of the allocation overhead. Since the maximum number of states is limited to the size of the string, I could use simple arrays to store the states. And get rid of a separate class too. That lead to this:

This gave me 11 ms and beat 78%. Then I thought about reducing the number of states. Two character states are kind of hard to locate and compare. How would I know that I already have a [ca]* state before adding an [ac]* state? But for one-character states that would be easy. Just don’t create a new state when you see the same character again:

But when I executed this I got 41 ms (32%)! Why?! The reason, I suspect, is the branch prediction. Adding additional unpredictable branch into the loop probably made tests containing a lot of short same-character substrings execute much slower. Unfortunately, LeetCode doesn’t publish test cases, so there was no way to profile it. It did, however, executed much faster on a long string consisting of only one repeating character.

Then I decided to look up other solutions. One that I liked in particular is based on having just two indices. Constant memory and linear time. Not bad, eh?

That made me think about optimizing my solution further. I still had a feeling that the number of states may be greatly reduced, but how? And then I thought, how many states can my NFA be in at any given moment? Hey, looks like it’s only two! That’s right, at any given moment my NFA matches one single character class (that is the last character I saw) and possibly one two-character class (the previous two distinct characters I saw). Looks like I don’t even need an array of states! And it also looks like I don’t have to check for maximum each time I inspect a state. I can only check when a state is about to be removed (no longer matches) or when I reach the end of the string.

This is the final solution (5 ms, 96%). And when I actually implemented it, I realized that it is very similar to that two-pointer solution. In fact, it is the same algorithm written in a slightly different way. But the way I arrived at it! Looking at this code, one would never see any NFAs in it, and yet the code is based on the same idea.

It can be made even faster by first converting the string into a character array, like that 100% guy did. But I don’t like this approach much because it essentially trades off infinitesimal performance boost for double memory consumption. Another approach is to use Reflection to access the “value” field of the string, but that is beyond good and evil—it won’t work for any Java implementation that happens to have the field named differently. That kind of dirty hack should be reserved for desperate situations only.

Leave a Reply