Ian Fiddes
Ian Fiddes

Reputation: 3011

Sorting .csv file by column title

Is there a way to sort a csv file by column header name (sort vertically) without loading the whole thing into memory? I tagged this as python because it is the language I am most familiar with, but any other way would be fine also. I am limited to doing this via commandline on a remote machine due to data protection rules.

Upvotes: 0

Views: 634

Answers (1)

abarnert
abarnert

Reputation: 366163

Any on-disk sorting algorithm is going to require more disk operations than just reading and writing once, and that I/O is likely to be your bottleneck. And it's going to more complicated as well. So, unless you really can't fit the file into memory, it will be a lot faster to do so, and a whole lot simpler.

But if you have to do this…

The standard on-disk sorting algorithm is a merge sort, similar to the familiar in-memory merge sort. It works like this:

Split the file into chunks that are big enough to fit into memory. You can do this iteratively/lazily, and easily: just read, say, 100MB at a time. Just make sure to rfind the last newline and hold everything after it over for the next chunk.

For each chunk, sort it in memory, and write the result to a temporary file. You can use the csv module, and the sort function with key=itemgetter(colnum).

If you have, say, 10 or fewer chunks, just open all of the temporary files and merge them. Again, you can use the csv module, and min with the same key or heapq.merge with equivalent decorate-sort-undecorate.

If you have 10-100 chunks, merge groups of 10 into larger temp files, then merge the larger ones in exactly the same way. With 100-1000, or 1000-10000, etc., just keep doing the same thing recursively.


If you have a simple CSV file with no quoting/escaping, and you have either ASCII data, ASCII-superset data that you want to sort asciibetically, or ASCII-superset data that you want to sort according to LC_COLLATE, the POSIX sort command does exactly what you're looking for, in the same way you'd probably build it yourself. Something like this:

sort -t, -k ${colnum},${colnum} -i infile.csv -o outfile.csv

If your data don't meet those requirements, you might be able to do a "decorate-sort-undecorate" three-pass solution. But at that point, it might be easier to switch to Python. Trying to figure out how to sed an arbitrary Excel CSV into something sort can handle and that can be reversed sounds like you'd waste more time debugging edge cases than you would writing the Python.

Upvotes: 3

Related Questions