How to groupby and preserve groups order on sorted file

Question

I have a large CSV file that is sorted by few of its columns, let's call these columns sorted_columns.
I want to perform a groupby on these sorted_columns and apply some logic on each one of these groups.

The file does not fit completely into memory so I want to read it in chunks and perform a groupby on each chunk.

The thing I have noticed is that the order of the groups is not preserved even though the file is already sorted by these columns.

Eventually, this is what I am trying to do:

import pandas as pd

def run_logic(key, group):
    # some logic
    pass

last_group = pd.DataFrame()
last_key = None

for chunk_df in df:
    grouped_by_df = chunk_df.groupby(sorted_columns, sort=True)

    for key, group in grouped_by_df:
        if last_key is None or last_key == key:
            last_key = key
            last_group = pd.concat([last_group, group])
        else:  # last_key != key
            run_logic(last_key, last_group)
            last_key = key
            last_group = group.copy()
run_logic(last_key, last_group)

But this does not work because it is not promised by the groupby that the order of the groups is preserved. If the same key exists in two consecutive chunks it is not promised that at the first chunk it will be the last group and at the next chunk it will be the first one. I tried changing the groupby to use sort=False and also tried to change the order of the columns, but it didn't help.

Does anyone have any idea of how to preserve the order of the groups if the keys are already sorted in the original file?

Any other way to read a complete group at once from the file?

Kristian · Accepted Answer

itertools.groupby

Will return the key and an iterator for all the values grouped by that key. If you file is already sorted by your desired key, you are good to go. the groupby function will handle almost everything for you.

From the documentation:

The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

run_logic is whatever business logic you want to apply to the group of records. This example is just simply counting the number of observations in the iterator.

data_iter simply emits 1 row per CSV. As long as your file is sorted by the desired fields, you do not need to read the entire file into memory.

chunks uses the groupby to group the input iterator using the first 3 fields of the input row. It yields the key and the corresponding iterator of values associated with that key.

#!/usr/bin/env python3

import csv
from itertools import groupby

def run_logic(key, group):
    cntr = 0
    for rec in group:
        cntr = cntr + 1
    return (key, cntr)


def data_iter(filename):
    with open(filename, "r") as fin:
        csvin = csv.reader(fin)
        for row in csvin:
            yield row


def chunks(diter):
    for chunk, iter_ in groupby(diter, key=lambda x: x[0:3]):
        yield (chunk, iter_)


if __name__ == "__main__":
    csviter = data_iter("test.csv")
    chunk_iter = chunks(csviter)
    for chunk, iter_ in chunk_iter:
        print(run_logic(chunk, iter_))

Input data

['1', '1', '1', 'a', 'a', 'a', 'a']  
['1', '1', '1', 'b', 'b', 'b', 'b']  
['1', '1', '1', 'c', 'c', 'c', 'c']  
['1', '1', '1', 'd', 'd', 'd', 'd']  
['1', '1', '1', 'e', 'e', 'e', 'e']  
['2', '1', '1', 'a', 'a', 'a', 'a']  
['2', '1', '1', 'd', 'd', 'd', 'd']  
['2', '1', '1', 'e', 'e', 'e', 'e']  
['2', '1', '1', 'b', 'b', 'b', 'b']  
['2', '1', '1', 'c', 'c', 'c', 'c']  
['3', '1', '1', 'e', 'e', 'e', 'e']  
['3', '1', '1', 'b', 'b', 'b', 'b']  
['3', '1', '1', 'c', 'c', 'c', 'c']  
['3', '1', '1', 'a', 'a', 'a', 'a']  
['3', '1', '1', 'd', 'd', 'd', 'd']

groupby data

Group: ['1', '1', '1']

['1', '1', '1', 'a', 'a', 'a', 'a']
['1', '1', '1', 'b', 'b', 'b', 'b']
['1', '1', '1', 'c', 'c', 'c', 'c']
['1', '1', '1', 'd', 'd', 'd', 'd']
['1', '1', '1', 'e', 'e', 'e', 'e']

Group: ['2', '1', '1']

['2', '1', '1', 'a', 'a', 'a', 'a']
['2', '1', '1', 'd', 'd', 'd', 'd']
['2', '1', '1', 'e', 'e', 'e', 'e']
['2', '1', '1', 'b', 'b', 'b', 'b']
['2', '1', '1', 'c', 'c', 'c', 'c']

Group: ['3', '1', '1']

['3', '1', '1', 'e', 'e', 'e', 'e']
['3', '1', '1', 'b', 'b', 'b', 'b']
['3', '1', '1', 'c', 'c', 'c', 'c']
['3', '1', '1', 'a', 'a', 'a', 'a']
['3', '1', '1', 'd', 'd', 'd', 'd']

Apply business logic