Sophie
Sophie

Reputation: 63

Sorting algorithm: Big text file with variable-length lines (comma-separated values)

What's a good algorithm for sorting text files that are larger than available memory (many 10s of gigabytes) and contain variable-length records? All the algorithms I've seen assume 1) data fits in memory, or 2) records are fixed-length. But imagine a big CSV file that I wanted to sort by the "BirthDate" field (the 4th field):

Id,UserId,Name,BirthDate
1,psmith,"Peter Smith","1984/01/01"
2,dmehta,"Divya Mehta","1985/11/23"
3,scohen,"Saul Cohen","1984/08/19"
...
99999999,swright,"Shaun Wright","1986/04/12"
100000000,amarkov,"Anya Markov","1984/10/31"

I know that:

  1. This would run on one machine (not distributed).
  2. The machine that I'd be running this on would have several processors.
  3. The files I'd be sorting could be larger than the physical memory of the machine.
  4. A file contains variable-length lines. Each line would consist of a fixed number of columns (delimiter-separated values). A file would be sorted by a specific field (ie. the 4th field in the file).
  5. An ideal solution would probably be "use this existing sort utility", but I'm looking for the best algorithm.
  6. I don't expect a fully-coded, working answer; something more along the lines of "check this out, here's kind of how it works, or here's why it works well for this problem." I just don't know where to look...
  7. This isn't homework!

Thanks! ♥

Upvotes: 6

Views: 1799

Answers (4)

A.D. Fundum
A.D. Fundum

Reputation: 1

No need to sort. Read the file ALL.CSV and append each read line to a file per day, like 19841231.CSV. For each existing day with data, in numerical order, read that CSV file and append those lines to a new file. Optimizations are possible by, for example, processing the original file more than once or by recording days actually occuring in the file ALL.CSV.

So a line containing "1985/02/28" should be added to the file 19850228.CSV. The file 19850228.CSV should be appended to NEW.CSV after the file 19850227.CSV was appended to NEW.CSV. The numerical order avoids the use of all sort algorithms, albeit it could torture the file system.

In reality the file ALL.CSV could be split in a file per, for example, year. 1984.CSV, 1985.CSV, and so on.

Upvotes: 0

Chris Walton
Chris Walton

Reputation: 2533

Suggest the following resources:

Merge Sort: http://en.wikipedia.org/wiki/Merge_sort

Seminumerical Algorithms, vol 2 of The Art of Computer Programming: Knuth: Addison Wesley:ISBN 0-201-03822-6(v.2)

Upvotes: 1

Maxim Razin
Maxim Razin

Reputation: 9466

A standard merge sort approach will work. The common schema is

  1. Split the file into N parts of roughly equal size
  2. Sort each part (in memory if it's small enough, otherwise recursively apply the same algorithm)
  3. Merge the sorted parts

Upvotes: 0

NPE
NPE

Reputation: 500713

This class of algorithms is called external sorting. I would start by checking out the Wikipedia entry. It contains some discussion and pointers.

Upvotes: 3

Related Questions