meliksahturker
meliksahturker

Reputation: 1504

pd.get_dummies() with seperator and counts

I have a data that looks like:

index stringColumn
0 A_B_B_B_C_C_D
1 A_B_C_D
2 B_C_D_E_F
3 A_E_F_F_F

I need to vectorize this stringColumn with counts, ending up with:

index A B C D E F
0 1 3 2 1 0 0
1 1 1 1 1 0 0 
2 0 1 1 1 1 1
3 1 0 0 0 1 3

Therefore I need to do both: counting and splitting. Pandas str.get_dummies() function allows me to split the string using sep = '_' argument, however it does not count multiple values. pd.get_dummies() does the counting but it does not allow seperator.

My data is huge so I am looking for vectorized solutions, rather than for loops.

Upvotes: 0

Views: 577

Answers (1)

jezrael
jezrael

Reputation: 863166

You can use Series.str.split with get_dummies and sum:

df1 = (pd.get_dummies(df['stringColumn'].str.split('_', expand=True), 
                    prefix='', prefix_sep='')
         .sum(level=0, axis=1))

Or count values per rows by value_counts, replace missing values by DataFrame.fillna and convert to integers:

df1 = (df['stringColumn'].str.split('_', expand=True)
                         .apply(pd.value_counts, axis=1)
                         .fillna(0)  
                         .astype(int))
       

Or use collections.Counter, performance should be very good:

from collections import Counter

df1 = (pd.DataFrame([Counter(x.split('_')) for x in df['stringColumn']])
         .fillna(0)
         .astype(int))
        

Or reshape by DataFrame.stack and count by SeriesGroupBy.value_counts:

df1 = (df['stringColumn'].str.split('_', expand=True)
                         .stack()
                         .groupby(level=0)
                         .value_counts()
                         .unstack(fill_value=0))

print (df1)

   A  B  C  D  E  F
0  1  3  2  1  0  0
1  1  1  1  1  0  0
2  0  1  1  1  1  1
3  1  0  0  0  1  3

Upvotes: 3

Related Questions