Kira
Kira

Reputation: 569

Sorting a 20GB file with one string per line

In question 11.5 of Gayle Laakman's book, Cracking the Technical Interview,

"Imagine you have a 20GB file with one string per line. Explain how you would sort the file"

My initial reaction was exactly the solution that she proposed - splitting the file into smaller chunks (megabytes) by reading in X mb's of data, sorting it, and then writing it to disk. And at the very end, merge the files.

I decided not to pursue this approach because the final merge would involve holding on to all the data in main memory - and we're assuming that's not possible. If that's the case, how exactly does this solution hold?

My other approach is based on the assumption that we have near unlimited disk space, or at least enough to hold 2X the data we already have. We can read in X mb's of data and then generate hash keys for them - each key corresponding to a line in a file. We'll continue doing this until all values have been hashed. Then we just have to write the values of that file into the original file.

Let me know what you think.

Upvotes: 7

Views: 3342

Answers (1)

Sahir Contractor
Sahir Contractor

Reputation: 115

http://en.wikipedia.org/wiki/External_sorting gives a more detailed explanation on how external sorting works. It addresses your concern of eventually having to bring the entire 20gB into memory by explaining how you perform the final merge of the N sorted chunks by reading in chunks of the sorted chunks as opposed to reading in all the sorted chunks at the same time.

Upvotes: 4

Related Questions