Searching and sorting large data set

Question

I have been practicing some algorithm questions for interviews and have stumbled upon various questions that relate to sorting data that comes from an infinite stream and designing a data structure to search billions of records

Describe how to sort integers read one at a time from an infinite stream

Search through a tremendous number of elements is a search space. I.E. You're asked to design a storage structure and search algorithm to search through 100 billion data records. You can have multiple servers and multiple threads.

Here is what I think, please correct me if I am wrong or if there is a better solution

For sorting integers read one at a time from infinite stream, could we use insertion sort? Worst case of insertion sort is O(n2) to sort an unsorted list but in this case the run time could be lowered down to O(logn). When the new element is to be inserted into the already sorted stream, we could just perform a binary search for the new element and insert it in logn time. But we would need to shift all the items to the right by 1 which would result in O(N). I am still not sure if this is right though
We would use a Balanced BST which would have the worst case for insertion and searching to be logN or we could just use a HashMap which would ideally perform lookup in O(1) and insertion in O(1). However as we are dealing with 100 billion records, our worst case lookup for HashMap would be O(N) with a linked list implementation.

I still don't have a clear answer for these questions. If someone could provide some more insight, it would be great!

thanks!

Jim Mischel · Accepted Answer

For sorting a very large amount of data, you typically do it in two steps:

Buffer data as it comes in until you've received some (typically very large) number of data items. You then sort those and write the sorted block to disk. You continue to do this until you've received and sorted all of the data.
After you've sorted all the blocks, you do a k-way merge of the sorted blocks to create a single sorted file.

The buffering and sorting can be done in parallel if you have enough horsepower. As each block is received, you spin up a thread to sort it while the main thread continues to receive data in a new block. This isn't infinitely scalable, of course, because it's going to take much longer to sort a large buffer than it does to receive. So you might have to write each block to disk as it's received, and have a fixed number of background threads that sort those blocks. The basic algorithm is the same, though ... just a little time delayed.

If you can use multiple machines for searching, you'd typically spread the data among the many machines. So if you have 4 machines, each machine gets 1/4 of the data. When you want to do a search, you have each machine search its data set for matching records, and communicate those results to some central place, which sorts and removes duplicates.

Now, if you want to maintain a sorted data structure from a potentially infinite stream (i.e. be able to search at any time while the data is being received), then you need something more dynamic. One simple way is to have your main sorted structure, and also your "not yet sorted" buffer. So, for example, say you have already received a billion items that you've already sorted and stored, and you have a buffer size of 1 million items. As data is received, you store a million items in memory before you merge them with the primary data structure.

When you receive a search query, you search the main structure, which would be O(log N) if you used a binary search, and then you search your receive buffer sequentially. Granted the sequential search is a little slow because it's sequential, but all of the data is in memory so you don't have to pay the cost of I/O.

When your buffer fills, you use an efficient algorithm to merge it with the stored structure.

That's the basic idea. There are many ways to improve efficiency with multiple levels of merging, or by using a better data structure than a binary tree or similar.

Searching and sorting large data set

Answers (1)

Related Questions