Think
Think

Reputation: 261

How to Aggregate records in Python?

I have a following csv file (each line is dynamic number of characters but the columns are fixed... hope i am making sense)

   **001**     Math        **02/20/2013**  A

   **001**     Literature  **03/02/2013**  B

   **002**     Biology     **01/01/2013**  A

   **003**     Biology     **04/08/2013**  A

   **001**     Biology     **05/01/2013**  B

   **002**     Math        **03/10/2013**  C

I am trying to get results into another csv file in the following format where it is grouped by student id and order by date ascending order.

   001,#Math;A;02/20/2013#Biology;B;05/01/2013#Literature;B;03/02/2013

   002,#Biology;A;01/01/2013#Math;C;03/10/2013

   003,#Biology;A;04/08/2013

There is one constraint though. The input file is huge around 200 millions rows. I tried using c# and storing it DB and write sql query. Its very slow and not accepted. After googling i hear python is very powerful for these operations. I am new to Python started playing with the code. I really appreciate the PYTHON gurus to help me to get the results as I mentioned above.

Upvotes: 1

Views: 581

Answers (1)

Jakub M.
Jakub M.

Reputation: 33827

content='''
   **001**     Math        **02/20/2013**  A

   **001**     Literature  **03/02/2013**  B

   **002**     Biology     **01/01/2013**  A

   **003**     Biology     **04/08/2013**  A

   **001**     Biology     **05/01/2013**  B

   **002**     Math        **03/10/2013**  C
'''

from collections import defaultdict

lines = content.split("\n")
items_iter = (line.split() for line in lines if line.strip())

aggregated = defaultdict(list)

for items in items_iter:
    stud, class_, date, grade = (t.strip('*') for t in items)
    aggregated[stud].append((class_, grade, date))

for stud, data in aggregated.iteritems():
    full_grades = [';'.join(items) for items in data]
    print '{},#{}'.format(stud, '#'.join(full_grades))

Output:

003,#Biology;A;04/08/2013
002,#Biology;A;01/01/2013#Math;C;03/10/2013
001,#Math;A;02/20/2013#Literature;B;03/02/2013#Biology;B;05/01/2013

Of course, this is an ugly hackish code just to show you how it can be done in python. When working with large streams of data, use generators and iterators, and don't use file.readlines(), just iterate. The iterators will not read all the data at once but read chunk-by-chunk when you iterate over them, and not earlier.

If you are concerned if 200m records fit memory, then do the following:

  1. sort the records into separate "buckets" (like in bucket sort) by students id

    cat all_records.txt | grep 001 > stud_001.txt # do if for other students also

  2. do the processing per bucket

  3. merge

grep is just example. make a fancier script (awk or also python) that will filter by student ID and, for example, filter all with ID < 1000, later 1000 < ID < 2000 and so on. You can do it safely because your records per student are disjoint.

Upvotes: 2

Related Questions