Reputation: 1504
I have a data that looks like:
index stringColumn
0 A_B_B_B_C_C_D
1 A_B_C_D
2 B_C_D_E_F
3 A_E_F_F_F
I need to vectorize this stringColumn with counts, ending up with:
index A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
Therefore I need to do both: counting and splitting. Pandas str.get_dummies() function allows me to split the string using sep = '_' argument, however it does not count multiple values. pd.get_dummies() does the counting but it does not allow seperator.
My data is huge so I am looking for vectorized solutions, rather than for loops.
Upvotes: 0
Views: 577
Reputation: 863166
You can use Series.str.split
with get_dummies
and sum
:
df1 = (pd.get_dummies(df['stringColumn'].str.split('_', expand=True),
prefix='', prefix_sep='')
.sum(level=0, axis=1))
Or count values per rows by value_counts
, replace missing values by DataFrame.fillna
and convert to integers:
df1 = (df['stringColumn'].str.split('_', expand=True)
.apply(pd.value_counts, axis=1)
.fillna(0)
.astype(int))
Or use collections.Counter
, performance should be very good:
from collections import Counter
df1 = (pd.DataFrame([Counter(x.split('_')) for x in df['stringColumn']])
.fillna(0)
.astype(int))
Or reshape by DataFrame.stack
and count by SeriesGroupBy.value_counts
:
df1 = (df['stringColumn'].str.split('_', expand=True)
.stack()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
print (df1)
A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0
2 0 1 1 1 1 1
3 1 0 0 0 1 3
Upvotes: 3