Reputation: 261
I have a following csv file (each line is dynamic number of characters but the columns are fixed... hope i am making sense)
**001** Math **02/20/2013** A
**001** Literature **03/02/2013** B
**002** Biology **01/01/2013** A
**003** Biology **04/08/2013** A
**001** Biology **05/01/2013** B
**002** Math **03/10/2013** C
I am trying to get results into another csv file in the following format where it is grouped by student id and order by date ascending order.
001,#Math;A;02/20/2013#Biology;B;05/01/2013#Literature;B;03/02/2013
002,#Biology;A;01/01/2013#Math;C;03/10/2013
003,#Biology;A;04/08/2013
There is one constraint though. The input file is huge around 200 millions rows. I tried using c# and storing it DB and write sql query. Its very slow and not accepted. After googling i hear python is very powerful for these operations. I am new to Python started playing with the code. I really appreciate the PYTHON gurus to help me to get the results as I mentioned above.
Upvotes: 1
Views: 581
Reputation: 33827
content='''
**001** Math **02/20/2013** A
**001** Literature **03/02/2013** B
**002** Biology **01/01/2013** A
**003** Biology **04/08/2013** A
**001** Biology **05/01/2013** B
**002** Math **03/10/2013** C
'''
from collections import defaultdict
lines = content.split("\n")
items_iter = (line.split() for line in lines if line.strip())
aggregated = defaultdict(list)
for items in items_iter:
stud, class_, date, grade = (t.strip('*') for t in items)
aggregated[stud].append((class_, grade, date))
for stud, data in aggregated.iteritems():
full_grades = [';'.join(items) for items in data]
print '{},#{}'.format(stud, '#'.join(full_grades))
Output:
003,#Biology;A;04/08/2013
002,#Biology;A;01/01/2013#Math;C;03/10/2013
001,#Math;A;02/20/2013#Literature;B;03/02/2013#Biology;B;05/01/2013
Of course, this is an ugly hackish code just to show you how it can be done in python. When working with large streams of data, use generators and iterators, and don't use file.readlines()
, just iterate. The iterators will not read all the data at once but read chunk-by-chunk when you iterate over them, and not earlier.
If you are concerned if 200m records fit memory, then do the following:
sort the records into separate "buckets" (like in bucket sort) by students id
cat all_records.txt | grep 001 > stud_001.txt # do if for other students also
do the processing per bucket
merge
grep
is just example. make a fancier script (awk or also python) that will filter by student ID and, for example, filter all with ID < 1000, later 1000 < ID < 2000 and so on. You can do it safely because your records per student are disjoint.
Upvotes: 2