hz2021
hz2021

Reputation: 41

count the duplicate rows in pandas, and a very big csv file

I'm tyring to count the duplciate rows in a csv file. An example look like the following

head tail count

134;  135;  1

134;  136;  1

134;  137;  2

134;  135;  2

134;  136;  1

and want the duplicate rows (on head and tail columns) to be count and add the count together.

result looks like the following

head tail count

134;  135;  3

134;  136;  2

134;  137;  2

Another problem is that the csv file is super big (60GB), RAM is 64G btw, if set the chunksize to some number and do the iteration like:

for df in pd.read("*.csv", sep = ";",chunksize = 100000):

     do the duplicate count

the count process will only be done in that part of df and not globally.

So what we want is actually to do the count in the whole file, but the file is too big.

Thanks

hz

Upvotes: 0

Views: 673

Answers (2)

Corralien
Corralien

Reputation: 120489

Use Counter from collections module:

Input data:

>>> %cat data.csv
head;tail;count
134;135;1
134;136;1
134;137;2
134;135;2
134;136;1
from collections import Counter

for df in pd.read_csv(io.StringIO(text), sep=';', chunksize=2):
    c.update(df.groupby(['head', 'tail'])['count'].sum().to_dict())

Output result:

>>> c
Counter({(134, 135): 3, (134, 136): 2, (134, 137): 2})

Convert the Counter to a DataFrame:

df = pd.DataFrame.from_dict(c, orient='index', columns=['count'])
mi = pd.MultiIndex.from_tuples(df.index, names=['head', 'tail'])
df = df.set_index(mi).reset_index()
>>> df
   head  tail  count
0   134   135      3
1   134   136      2
2   134   137      2

Upvotes: 2

Pedro Holanda
Pedro Holanda

Reputation: 301

One possibility would be to use DuckDB to perform the distinct count and then export the result to a pandas dataframe.

Duckdb is a vectorized state-of-the-art DBMS for analytics and can run queries directly on the CSV file. It is also tightly integrated with Pandas so you can easily import/export data to dataframes.

To install DuckDB you can simply run a pip install duckdb

The following code should work for your purposes:

import duckdb
rel = duckdb.from_csv_auto(temp_file_name)
count_query = ''' SELECT column0, COUNT(column0)
FROM my_name_for_rel
GROUP BY column0
HAVING COUNT(column0)>1'''
res = rel.query('my_name_for_rel', count_query)
data_frame = res.df()

Upvotes: 1

Related Questions