The Median of Two Sorted Arrays

It’s been a while since I created this site. I’ve been thinking about pulling some bits out of my older sites and blogs to put here before I actually continue on with this thing, but that proved to be rather tedious work. So after almost two years I figured that I’d better start a new life, so to speak, and be done with it.

So here is my first post in this new life. I have recently discovered the LeetCode site which is an awesome tool for getting ready for a software engineer interview. It’s not like I’m going to do one any time soon (although who knows?), but I felt that it’s a great chance to polish up my coding skills, so I went on.

At first I looked up solutions if I was unable to find one myself in a matter of a few hours. After doing that several times, though, I found it disappointing. While some of those problems required knowledge of arcane algorithms that people like Knuth spent tens of years to develop, some others only required hard thinking, and I felt like I was giving up too early. So I started looking harder for solutions and stopped looking them up. In the worst case, when I felt like I need a very specific algorithm to solve a part of the problem, I’d look up that very algorithm and then start to think how I would go about using it in this problem.

One of the problems I had to solve recently was this. Find the median of two sorted arrays of integers. The median is defined as the middle number of the array that would be the result of merging the two arrays together into one sorted array. And if this array would have even number of elements, then the median is the average of the two middle ones. The required time complexity was O(\log (n+m)), where n and are array sizes (it would be trivial to solve in linear time).

And this problem baffled me completely. It was pretty obvious that it doesn’t require any fancy algorithms. No knowledge of graphs and trees and whatnot would help me in this problem. The logarithmic complexity requirement obviously meant some sort of a binary search, but how would one go with binary searching two arrays?

My first though was to take medians of the two arrays and compare them. Never mind the odd/even size cases for now. If the medians are equal, then it is probably the answer. The idea is that if I was to merge the arrays, then half of the elements from the second array would go to the left part and half to the right part, so the median will remain the median. But what if they are not equal? Suppose the median of the second array is larger. That means that the second array has more elements that are larger than the median of the first one. It follows that if I was to merge the second array into the first one, then the median of the first one will shift left, so the real median is greater than or at least equal to it. By the same logic, if I was to merge the first array into the second one, it becomes clear that the median must also be less than or at least equal to the median of the second array.

So far so good. That gets rid of the left part of the first array and the right part of the second array. If I am able to continue this process, then I’ll get exactly log(n+m) complexity. But continuing this process proved more complicated than it seemed because it makes no sense any more to pick medians of the remaining arrays. I’d get a wrong answer because once I got rid of some elements, it’s not the median any more that I’m looking for. At least not if the arrays are of different sizes. Now I understand that if I continued with that approach, I’d get the right answer if I was able to figure out what I was looking for exactly. And that’s is the kth smallest element. For the median it’s roughly \frac{n+m}{2}th element. And by throwing away some elements, I reduce the problem to the problem of finding \frac{n+m}{2}-lth, where l is the number of elements I have thrown away. But at that time I wasn’t able to get a good grasp of the exact nature of the problem, so I got stuck.

The next idea was to use a binary search on the first array. Take the middle element, then check if it’s the median or not. How do I check? Well, the median needs to have exactly the same (plus-minus one if the length is even) amount of elements less than or equal to it to the left and greater than or equal to the right. For any given element of the second array, I know how many elements are to the left and to the right of it are there, but now I need to find out how many of them are in the second array. I can determine that by performing a binary search on the second array. If the would-be median is not there, then I find its insertion point and I know exactly how many elements less than and greater than are there. If it is there, though, then I need to look to the left and to the right of it because there may be many repeating elements that are equal to it. By repeating this process, I’d find the median if it is in the first array. If it’s not, then I need to repeat the whole thing for the second one. And for even sum of lengths I need to do the same thing for the second median element. Since I am performing a binary search, and on each step there is another binary search, the resulting complexity is O(\log n \log m), which is not exactly what we want either.

And then I saw the light. When doing the binary search, I need not calculate how many elements less/greater than or equal to the would-be median there are in the second array! I already know how many are there in the first one, so I know exactly how many I am lacking. What I need to check is whether there is exactly the right amount in the second array. I can determine that instantly by looking at the elements i and i-1 in the second array, where i is the number of the lacking elements to the left of the would-be median. If the i-1th element is less than or equal to the would-be median and the ith element is greater than or equal to it, then it is the median. Note that both of these conditions cannot be false because the array is sorted. If the first condition is false, it means that there are not enough elements in the second array, so the median must be somewhere to the right in the first array (if it’s there in the first place). If the second condition is false, it means that there are too many elements and the median is to the left.

If the process finds the median, then all is good. If it doesn’t then I now have an insertion point, that is, the place where it would be if it was there. But that means I know exactly how many elements less than median are there in the first array! That gives me the location of the median in the second array instantly! Bingo!

Moreover, since I am performing a binary search on only one array, I might as well just pick the smallest one. That gives O(\log \min (m, n)) complexity! Although it’s only about twice as faster than the required one if the sizes are of the same order.

And to make things even better, I don’t need to repeat the process twice for even sizes. If I find the first median, then the next one must be right after it in one of the arrays. And since I know the position of the first median in one array and the insertion point in another, I can just check both, and pick the minimum one.

The full code for this solution is below. Somehow it beats less than 5% of other solutions on LeetCode, but that doesn’t mean much for two reasons. First is that people tend to take the best solution published by other submitters and submit them as their own simply to check whether that solution is really that fast (there is no way to figure that out without submitting). That leads to large peaks at the best solutions. Another reason is that my solution is 8 ms, while the best ones are 5 ms. But 3 ms is very little difference to measure precisely and I don’t know exactly what the margin of error is.


Leave a Reply