Reputation: 3208
I'm having a Pandas DataFrame()
and within it, some columns are Pythons' lists
which contains strings
.
I'd like to transform those columns into dummies that "binarizes" the strings and count their appearances.
As a simple example we can look at the following
import pandas
df = pd.DataFrame({"Hey":[['t1', 't2', 't1', 't3', 't1', 't3'], ['t2', 't2', 't1']]})
df
Out[54]:
Hey
0 [t1, t2, t1, t3, t1, t3]
1 [t2, t2, t1]
I've managed to do the following:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df['Hey']), columns=list(map(lambda x: 'Hey_' + x, mlb.classes_)))
Out[55]:
Hey_t1 Hey_t2 Hey_t3
0 1 1 1
1 1 1 0
Which doesn't count their appearances, but only yield 1 for occurances and 0 for absence. I'd like the following output:
Hey_t1 Hey_t2 Hey_t3
0 3 1 2
1 1 2 0
Which counts their appearances.
Upvotes: 1
Views: 242
Reputation: 863166
Use CountVectorizer
but necessary join list
s:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
counts = countvec.fit_transform(df['Hey'].str.join(' '))
df = pd.DataFrame(counts.toarray(), columns=countvec.get_feature_names())
print (df)
t1 t2 t3
0 3 1 2
1 1 2 0
Another solution:
df1 = (pd.DataFrame(df['Hey'].values.tolist())
.stack()
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
print (df1)
t1 t2 t3
0 3 1 2
1 1 2 0
Or:
from collections import Counter
df1 = (pd.DataFrame([Counter(x) for i, x in df['Hey'].iteritems()], index=df.index)
.fillna(0).astype(int))
print (df1)
t1 t2 t3
0 3 1 2
1 1 2 0
Upvotes: 5
Reputation: 2374
I think you have a misunderstanding about sklearn.preprocessing.MultiLabelBinarizer. Since it is called Binarizer, it count whether a key occurs. That is to say, the value is binarized : if a key occurs, it is 1, otherwise it is 0. It doesn't count occurances.
Upvotes: 0
Reputation: 402813
Concise Counter
based alternative:
from collections import Counter
df = (pd.DataFrame([Counter(x) for i, x in df['Hey'].items()], index=df.index)
.fillna(0, downcast='infer'))
df
t1 t2 t3
0 3 1 2
1 1 2 0
Upvotes: 1