Reputation: 2710
I am trying to reshape a dataframe to create a kind of occurrence matrix but without success.
Is pandas.get_dummies()
the right way to do this at all ?
Here is what I tried so far
import pandas as pd
xlst_entries = [[u'aus', u'fra', u'gbr'],[u'gbr', u'prt'],[u'chn'],[u'bel', u'gbr'],[u'gbr', u'prt'],[u'gbr', u'prt'],[u'gbr', u'prt']]
qq1 = pd.DataFrame(xlst_entries)
qq2 = pd.get_dummies(data= qq1, prefix=None)
qq2
But the result I want is
index fra bel chn prt aus gbr
0 1 0 0 0 1 1
1 0 0 0 1 0 1
2 0 0 1 0 0 0
3 0 1 0 0 0 1
4 0 0 0 1 0 1
5 0 0 0 1 0 1
6 0 0 0 1 0 1
Upvotes: 2
Views: 112
Reputation: 13274
This is a somewhat general helper function which should work on almost any data.frame (written in python2, for python3 testing, please make sure to wrap the map
and reduce
functions with list
):
def get_multiple_dummies(dframe):
from functools import reduce
combined = [pd.get_dummies(dframe.iloc[:, i]) for i in range(len(dframe.columns))]
allcolumns = set(reduce(list.__add__, map(lambda y: y.columns.tolist(),
combined)))
combined = map(lambda x: pd.concat([x, pd.DataFrame(
columns = filter(lambda y: y not in x.columns,
allcolumns))]), combined)
return reduce(lambda x,y: x.fillna(0)+y.fillna(0), combined)
print get_multiple_dummies(qq1)
aus bel chn fra gbr prt
0 1 0 0 1 1 0
1 0 0 0 0 1 1
2 0 0 1 0 0 0
3 0 1 0 0 1 0
4 0 0 0 0 1 1
5 0 0 0 0 1 1
6 0 0 0 0 1 1
[7 rows x 6 columns]
Upvotes: 1
Reputation: 29711
You could tweak the parameters inside get_dummies
such that the prefix
of the columns formed is removed and sum the columns with same name to obtain the desired frame.
df = pd.get_dummies(df, prefix='', prefix_sep='')
df.groupby(df.columns, axis=1).agg(np.sum).astype(int)
aus bel chn fra gbr prt
0 1 0 0 1 1 0
1 0 0 0 0 1 1
2 0 0 1 0 0 0
3 0 1 0 0 1 0
4 0 0 0 0 1 1
5 0 0 0 0 1 1
6 0 0 0 0 1 1
Upvotes: 1
Reputation: 33803
You can do some preprocessing of xlst_entries
to combine all entries as a single string separated by |
, then use Series.str.get_dummies
:
xlst_entries = ['|'.join(x) for x in xlst_entries]
qq1 = pd.Series(xlst_entries).str.get_dummies()
The resulting output:
aus bel chn fra gbr prt
0 1 0 0 1 1 0
1 0 0 0 0 1 1
2 0 0 1 0 0 0
3 0 1 0 0 1 0
4 0 0 0 0 1 1
5 0 0 0 0 1 1
6 0 0 0 0 1 1
Upvotes: 1