Reputation: 4730
I have a large (2.2GB) text delimited file that holds chemical paths that I search when I want to go from chemical A to chemical B. I'm wondering if anyone knows of a way (preferably in python) that I could sort the file by number of columns in a row?
Example:
CSV:
A B C D
E F G
H I
J K L M N
Should sort to:
H I
E F G
A B C D
J K L M N
I've been thinking of making a hashtable of row lengths and rows, but as the csv files get larger: (we're running longest path on a chemical network and the 2.2gb (30mil paths) is only length <= 10), I anticipate this approach may not be the fastest.
Upvotes: 1
Views: 343
Reputation: 142166
I'd go for splitting them into separate files based on the length, then joining them back together afterwards - something like:
from tempfile import TemporaryFile
from itertools import chain
Keep a reference dict of file length->output file. Where a file is already opened, then write to it, or create a new temporary file.
output = {}
with open('input') as fin:
for line in fin:
length = len(line.split())
output.setdefault(length, TemporaryFile()).write(line)
As Steven Rumbalski has pointed out, this can be also done with a defaultdict
:
from collections import defaultdict
output = defaultdict(TemporaryFile)
...
output[length].write(line)
The temporary files will all be pointing to the end of the file. Reset them to the beginning so that when reading through them we get the data again...
for fh in output.values():
fh.seek(0)
Take the rows from each file in increasing order of length... and write them all to the final output file.
with open('output', 'w') as fout:
fout.writelines(chain.from_iterable(v for k,v in sorted(output.iteritems())))
Python should then clean up the temporary files upon program exit...
Upvotes: 5