Donbeo
Donbeo

Reputation: 17617

pandas find number of unique elements in each column of huge csv

I have a huge csv file around 10 GB and I get an error if I try to load in memory.

I need to compute the number of unique elements for each column of the dataframe. How can I do that?

Upvotes: 1

Views: 1362

Answers (1)

EdChum
EdChum

Reputation: 394209

You could load each col in turn and then call .nunique:

In [227]:

import io
t="""a,b,c
0,1,1
0,2,1
1,3,1
2,4,1
3,5,6"""
# get the columns first
cols = pd.read_csv(io.StringIO(t), nrows=1).columns
​
d = {}
for col in cols:
    df = pd.read_csv(io.StringIO(t), usecols=col)
    d[col] = df[col].nunique()
d
Out[227]:
{'a': 4, 'b': 5, 'c': 2}

this should then generate a dicts of the number of unique values for each columns

This assumes that you can handle loading a single column at a time from your 10GB file

Upvotes: 1

Related Questions