count the duplicate rows in pandas, and a very big csv file

Question

I'm tyring to count the duplciate rows in a csv file. An example look like the following

head tail count

134;  135;  1

134;  136;  1

134;  137;  2

134;  135;  2

134;  136;  1

and want the duplicate rows (on head and tail columns) to be count and add the count together.

result looks like the following

head tail count

134;  135;  3

134;  136;  2

134;  137;  2

Another problem is that the csv file is super big (60GB), RAM is 64G btw, if set the chunksize to some number and do the iteration like:

for df in pd.read("*.csv", sep = ";",chunksize = 100000):

     do the duplicate count

the count process will only be done in that part of df and not globally.

So what we want is actually to do the count in the whole file, but the file is too big.

Thanks

hz

Corralien · Accepted Answer

Use Counter from collections module:

Input data:

>>> %cat data.csv
head;tail;count
134;135;1
134;136;1
134;137;2
134;135;2
134;136;1

from collections import Counter

for df in pd.read_csv(io.StringIO(text), sep=';', chunksize=2):
    c.update(df.groupby(['head', 'tail'])['count'].sum().to_dict())

Output result:

>>> c
Counter({(134, 135): 3, (134, 136): 2, (134, 137): 2})

Convert the Counter to a DataFrame:

df = pd.DataFrame.from_dict(c, orient='index', columns=['count'])
mi = pd.MultiIndex.from_tuples(df.index, names=['head', 'tail'])
df = df.set_index(mi).reset_index()

>>> df
   head  tail  count
0   134   135      3
1   134   136      2
2   134   137      2

count the duplicate rows in pandas, and a very big csv file

Answers (2)

Related Questions