anurag
anurag

Reputation: 1932

Python - Pandas: perform column value based data grouping across separate dataframe chunks

I was handling a large csv file, and came across this problem. I am reading in the csv file in chunks and want to extract sub-dataframes based on values for a particular column.

To explain the problem, here is a minimal version:

The CSV (save it as test1.csv, for example)

1,10
1,11
1,12
2,13
2,14
2,15
2,16
3,17
3,18
3,19
3,20
4,21
4,22
4,23
4,24

Now, as you can see, if I read the csv in chunks of 5 rows, the first column's values will be distributed across the chunks. What I want to be able to do is load in memory only the rows for a particular value.

I achieved it using the following:

import pandas as pd

list_of_ids = dict()  # this will contain all "id"s and the start and end row index for each id

# read the csv in chunks of 5 rows
for df_chunk in pd.read_csv('test1.csv', chunksize=5, names=['id','val'], iterator=True):
    #print(df_chunk)

    # In each chunk, get the unique id values and add to the list
    for i in df_chunk['id'].unique().tolist():
        if i not in list_of_ids:
            list_of_ids[i] = []  # initially new values do not have the start and end row index

    for i in list_of_ids.keys():        # ---------MARKER 1-----------
        idx = df_chunk[df_chunk['id'] == i].index    # get row index for particular value of id
        
        if len(idx) != 0:     # if id is in this chunk
            if len(list_of_ids[i]) == 0:      # if the id is new in the final dictionary
                list_of_ids[i].append(idx.tolist()[0])     # start
                list_of_ids[i].append(idx.tolist()[-1])    # end
            else:                             # if the id was there in previous chunk
                list_of_ids[i] = [list_of_ids[i][0], idx.tolist()[-1]]    # keep old start, add new end
            
            #print(df_chunk.iloc[idx, :])
            #print(df_chunk.iloc[list_of_ids[i][0]:list_of_ids[i][-1], :])

print(list_of_ids)

skip = None
rows = None

# Now from the file, I will read only particular id group using following
#      I can again use chunksize argument to read the particular group in pieces
for id, se in list_of_ids.items():
    print('Data for id: {}'.format(id))
    skip, rows = se[0], (se[-1] - se[0]+1)
    for df_chunk in pd.read_csv('test1.csv', chunksize=2, nrows=rows, skiprows=skip, names=['id','val'], iterator=True):
        print(df_chunk)

Truncated output from my code:

{1: [0, 2], 2: [3, 6], 3: [7, 10], 4: [11, 14]}
Data for id: 1
   id  val
0   1   10
1   1   11
   id  val
2   1   12
Data for id: 2
   id  val
0   2   13
1   2   14
   id  val
2   2   15
3   2   16
Data for id: 3
   id  val
0   3   17
1   3   18

What I want to ask is, do we have a better way of doing this? If you consider MARKER 1 in the code, it is bound to be inefficient as the size grows. I did save memory usage, but, time still remains a problem. Do we have some existing method for this?

(I am looking for complete code in answer)

Upvotes: 0

Views: 646

Answers (1)

Dani Mesejo
Dani Mesejo

Reputation: 61910

I suggest you use itertools for this, as follows:

import pandas as pd
import csv
import io

from itertools import groupby, islice
from operator import itemgetter


def chunker(n, iterable):
    """
    From answer: https://stackoverflow.com/a/31185097/4001592
    >>> list(chunker(3, 'ABCDEFG'))
    [['A', 'B', 'C'], ['D', 'E', 'F'], ['G']]
    """
    iterable = iter(iterable)
    return iter(lambda: list(islice(iterable, n)), [])


chunk_size = 5
with open('test1.csv') as csv_file:
    reader = csv.reader(csv_file)
    for _, group in groupby(reader, itemgetter(0)):
        for chunk in chunker(chunk_size, group):
            g = [','.join(e) for e in chunk]
            df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
            print(df)
            print('---')

Output (partial)

   0   1
0  1  10
1  1  11
2  1  12
---
   0   1
0  2  13
1  2  14
2  2  15
3  2  16
---
   0   1
0  3  17
1  3  18
2  3  19
3  3  20
---
...

This approach will read first in groups by column 1:

for _, group in groupby(reader, itemgetter(0)):

and each group will be read in chunks of 5 rows (this can be change using chunk_size):

for chunk in chunker(chunk_size, group):

The last part:

g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')

creates a suitable string to be pass to pandas.

Upvotes: 1

Related Questions