Reputation: 41
I'm tyring to count the duplciate rows in a csv file. An example look like the following
head tail count
134; 135; 1
134; 136; 1
134; 137; 2
134; 135; 2
134; 136; 1
and want the duplicate rows (on head and tail columns) to be count and add the count together.
result looks like the following
head tail count
134; 135; 3
134; 136; 2
134; 137; 2
Another problem is that the csv file is super big (60GB), RAM is 64G btw, if set the chunksize to some number and do the iteration like:
for df in pd.read("*.csv", sep = ";",chunksize = 100000):
do the duplicate count
the count process will only be done in that part of df and not globally.
So what we want is actually to do the count in the whole file, but the file is too big.
Thanks
hz
Upvotes: 0
Views: 673
Reputation: 120489
Use Counter
from collections
module:
Input data:
>>> %cat data.csv
head;tail;count
134;135;1
134;136;1
134;137;2
134;135;2
134;136;1
from collections import Counter
for df in pd.read_csv(io.StringIO(text), sep=';', chunksize=2):
c.update(df.groupby(['head', 'tail'])['count'].sum().to_dict())
Output result:
>>> c
Counter({(134, 135): 3, (134, 136): 2, (134, 137): 2})
Convert the Counter to a DataFrame:
df = pd.DataFrame.from_dict(c, orient='index', columns=['count'])
mi = pd.MultiIndex.from_tuples(df.index, names=['head', 'tail'])
df = df.set_index(mi).reset_index()
>>> df
head tail count
0 134 135 3
1 134 136 2
2 134 137 2
Upvotes: 2
Reputation: 301
One possibility would be to use DuckDB to perform the distinct count and then export the result to a pandas dataframe.
Duckdb is a vectorized state-of-the-art DBMS for analytics and can run queries directly on the CSV file. It is also tightly integrated with Pandas so you can easily import/export data to dataframes.
To install DuckDB you can simply run a pip install duckdb
The following code should work for your purposes:
import duckdb
rel = duckdb.from_csv_auto(temp_file_name)
count_query = ''' SELECT column0, COUNT(column0)
FROM my_name_for_rel
GROUP BY column0
HAVING COUNT(column0)>1'''
res = rel.query('my_name_for_rel', count_query)
data_frame = res.df()
Upvotes: 1