Jayesh Menghani
Jayesh Menghani

Reputation: 17

Getting the frequency of categorical values in get dummies in pandas

I am implementing one hot encoding here on the data

Version  Cluster_Size     Hardware_type  
1.0.4     3              Aplha,Alpha,Aplha
1.0.2     3              Aplha,Beta,Aplha 
1.0.9     3              Aplha,Beta,Gama  

after df['hardware_type'].str.get_dummies(sep=', ') I was able to get the data frame like this

Version  Cluster_Size     Hardware_type      Alpha   Beta   Gama
1.0.4     3              Alpha,Alpha,Alpha     1       0      0
1.0.2     3              Alpha,Beta,Alpha      1       1      0
1.0.9     3              Alpha,Beta,Gama       1       1      1

which is exactly what the one-hot encoding should do but I am trying to achieve something like this wherein the columns I can get the count of categorical values appearing in their respective cell.

Version  Cluster_Size     Hardware_type      Alpha   Beta   Gama
1.0.4     3              Alpha,Alpha,Alpha     3       0      0
1.0.2     3              Alpha,Beta,Alpha      2       1      0
1.0.9     3              Alpha,Beta,Gama       1       1      1

Is there a way to do something like this ? Thanks for your time.

Upvotes: 1

Views: 517

Answers (1)

jezrael
jezrael

Reputation: 863146

If use Series.str.get_dummies there is no information about counts.

So need another solutions - here is used Counter with DataFrame constructor:

from collections import Counter
L = [Counter(x.split(',')) for x in df['Hardware_type']]
df = df.join(pd.DataFrame(L, index=df.index).fillna(0).astype(int))
print (df)
  Version  Cluster_Size      Hardware_type  Alpha  Beta  Gama
0   1.0.4             3  Alpha,Alpha,Alpha      3     0     0
1   1.0.2             3   Alpha,Beta,Alpha      2     1     0
2   1.0.9             3    Alpha,Beta,Gama      1     1     1

Or solution with Series.str.split, DataFrame.stack and SeriesGroupBy.value_counts is possible, but should be slowier (depends of data, the best test it):

s = df['Hardware_type'].str.split(',', expand=True).stack()
df = df.join(s.groupby(level=0).value_counts().unstack(fill_value=0))
print (df)
  Version  Cluster_Size      Hardware_type  Alpha  Beta  Gama
0   1.0.4             3  Alpha,Alpha,Alpha      3     0     0
1   1.0.2             3   Alpha,Beta,Alpha      2     1     0
2   1.0.9             3    Alpha,Beta,Gama      1     1     1

Upvotes: 2

Related Questions