sweeeeeet
sweeeeeet

Reputation: 1819

extract basic count per values using the chunksize parameter in pandas

I have a CSV file with the following categories: item1,item2,item3,item4 which values is exactly one of the following: 0,1,2,3,4. I would like to count for each items how many are there for each value. My code is the following, df being the corresponding DataFrame:

outputDf = pandas.DataFrame()
cat_list = list(df.columns.values)
for col in cat_list:
        s = df.groupby(col).size()
        outputDf[col] = s

I would like to do exactly the same using the chunksize parameter when I read my CSV with read_csv, because my CSV is very big. My problem is: I can't find a way to find the cat_list, neither to build the outputDf.

Can someone give me a hint?

Upvotes: 2

Views: 564

Answers (1)

DSM
DSM

Reputation: 353329

I'd apply value_counts columnwise rather than doing groupby:

>>> df = pd.read_csv("basic.csv", usecols=["item1", "item2", "item3", "item4"])
>>> df.apply(pd.value_counts)
   item1  item2  item3  item4
0     17     26     17     20
1     21     21     22     19
2     17     18     22     23
3     24     14     20     24
4     21     21     19     14

And for the chunked version, we just need to assemble the parts (making sure to fillna(0) so that if a part doesn't have a 3, for example, we get 0 and not nan.)

>>> df_iter = pd.read_csv("basic.csv", usecols=["item1", "item2", "item3", "item4"], chunksize=10)
>>> sum(c.apply(pd.value_counts).fillna(0) for c in df_iter)
   item1  item2  item3  item4
0     17     26     17     20
1     21     21     22     19
2     17     18     22     23
3     24     14     20     24
4     21     21     19     14

(Of course, in practice you'd probably want to use as large a chunksize as you can get away with.)

Upvotes: 3

Related Questions