Reputation: 17617
I have a huge csv file around 10 GB and I get an error if I try to load in memory.
I need to compute the number of unique elements for each column of the dataframe. How can I do that?
Upvotes: 1
Views: 1362
Reputation: 394209
You could load each col in turn and then call .nunique
:
In [227]:
import io
t="""a,b,c
0,1,1
0,2,1
1,3,1
2,4,1
3,5,6"""
# get the columns first
cols = pd.read_csv(io.StringIO(t), nrows=1).columns
d = {}
for col in cols:
df = pd.read_csv(io.StringIO(t), usecols=col)
d[col] = df[col].nunique()
d
Out[227]:
{'a': 4, 'b': 5, 'c': 2}
this should then generate a dicts of the number of unique values for each columns
This assumes that you can handle loading a single column at a time from your 10GB file
Upvotes: 1