Eran Moshe
Eran Moshe

Reputation: 3208

Pandas - Get dummies from an iterable of iterables with count of appearances

I'm having a Pandas DataFrame() and within it, some columns are Pythons' lists which contains strings. I'd like to transform those columns into dummies that "binarizes" the strings and count their appearances.

As a simple example we can look at the following

import pandas

df = pd.DataFrame({"Hey":[['t1', 't2', 't1', 't3', 't1', 't3'], ['t2', 't2', 't1']]})

df
Out[54]: 
                        Hey
0  [t1, t2, t1, t3, t1, t3]
1              [t2, t2, t1]

I've managed to do the following:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

pd.DataFrame(mlb.fit_transform(df['Hey']), columns=list(map(lambda x: 'Hey_' + x, mlb.classes_)))
Out[55]: 
   Hey_t1  Hey_t2  Hey_t3
0       1       1       1
1       1       1       0

Which doesn't count their appearances, but only yield 1 for occurances and 0 for absence. I'd like the following output:

   Hey_t1  Hey_t2  Hey_t3
0       3       1       2
1       1       2       0

Which counts their appearances.

Upvotes: 1

Views: 242

Answers (3)

jezrael
jezrael

Reputation: 863166

Use CountVectorizer but necessary join lists:

from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()
counts = countvec.fit_transform(df['Hey'].str.join(' '))
df = pd.DataFrame(counts.toarray(), columns=countvec.get_feature_names())
print (df)
   t1  t2  t3
0   3   1   2
1   1   2   0

Another solution:

df1 = (pd.DataFrame(df['Hey'].values.tolist())
        .stack()
        .groupby(level=0)
        .value_counts()
        .unstack(fill_value=0))
print (df1)
   t1  t2  t3
0   3   1   2
1   1   2   0

Or:

from collections import Counter

df1 = (pd.DataFrame([Counter(x) for i, x in df['Hey'].iteritems()], index=df.index)
        .fillna(0).astype(int))
print (df1)
   t1  t2  t3
0   3   1   2
1   1   2   0

Upvotes: 5

youkaichao
youkaichao

Reputation: 2374

I think you have a misunderstanding about sklearn.preprocessing.MultiLabelBinarizer. Since it is called Binarizer, it count whether a key occurs. That is to say, the value is binarized : if a key occurs, it is 1, otherwise it is 0. It doesn't count occurances.

Upvotes: 0

cs95
cs95

Reputation: 402813

Concise Counter based alternative:

from collections import Counter
df = (pd.DataFrame([Counter(x) for i, x in df['Hey'].items()], index=df.index)
        .fillna(0, downcast='infer'))
df
   t1  t2  t3
0   3   1   2
1   1   2   0

Upvotes: 1

Related Questions