Reputation: 17
I am implementing one hot encoding here on the data
Version Cluster_Size Hardware_type
1.0.4 3 Aplha,Alpha,Aplha
1.0.2 3 Aplha,Beta,Aplha
1.0.9 3 Aplha,Beta,Gama
after df['hardware_type'].str.get_dummies(sep=', ') I was able to get the data frame like this
Version Cluster_Size Hardware_type Alpha Beta Gama
1.0.4 3 Alpha,Alpha,Alpha 1 0 0
1.0.2 3 Alpha,Beta,Alpha 1 1 0
1.0.9 3 Alpha,Beta,Gama 1 1 1
which is exactly what the one-hot encoding should do but I am trying to achieve something like this wherein the columns I can get the count of categorical values appearing in their respective cell.
Version Cluster_Size Hardware_type Alpha Beta Gama
1.0.4 3 Alpha,Alpha,Alpha 3 0 0
1.0.2 3 Alpha,Beta,Alpha 2 1 0
1.0.9 3 Alpha,Beta,Gama 1 1 1
Is there a way to do something like this ? Thanks for your time.
Upvotes: 1
Views: 517
Reputation: 863146
If use Series.str.get_dummies
there is no information about counts.
So need another solutions - here is used Counter
with DataFrame
constructor:
from collections import Counter
L = [Counter(x.split(',')) for x in df['Hardware_type']]
df = df.join(pd.DataFrame(L, index=df.index).fillna(0).astype(int))
print (df)
Version Cluster_Size Hardware_type Alpha Beta Gama
0 1.0.4 3 Alpha,Alpha,Alpha 3 0 0
1 1.0.2 3 Alpha,Beta,Alpha 2 1 0
2 1.0.9 3 Alpha,Beta,Gama 1 1 1
Or solution with Series.str.split
, DataFrame.stack
and SeriesGroupBy.value_counts
is possible, but should be slowier (depends of data, the best test it):
s = df['Hardware_type'].str.split(',', expand=True).stack()
df = df.join(s.groupby(level=0).value_counts().unstack(fill_value=0))
print (df)
Version Cluster_Size Hardware_type Alpha Beta Gama
0 1.0.4 3 Alpha,Alpha,Alpha 3 0 0
1 1.0.2 3 Alpha,Beta,Alpha 2 1 0
2 1.0.9 3 Alpha,Beta,Gama 1 1 1
Upvotes: 2