user1043144
user1043144

Reputation: 2710

pandas : co-occurence matrix with get_dummies

I am trying to reshape a dataframe to create a kind of occurrence matrix but without success.

Is pandas.get_dummies() the right way to do this at all ?

Here is what I tried so far

import pandas as pd 

xlst_entries = [[u'aus', u'fra', u'gbr'],[u'gbr', u'prt'],[u'chn'],[u'bel', u'gbr'],[u'gbr', u'prt'],[u'gbr', u'prt'],[u'gbr', u'prt']]

qq1 = pd.DataFrame(xlst_entries)

qq2 = pd.get_dummies(data= qq1, prefix=None)
qq2

But the result I want is

index  fra  bel     chn     prt     aus     gbr

 0  1   0   0   0   1   1
 1  0   0   0   1   0   1
 2  0   0   1   0   0   0
 3  0   1   0   0   0   1
 4  0   0   0   1   0   1
 5  0   0   0   1   0   1
 6  0   0   0   1   0   1

Upvotes: 2

Views: 112

Answers (3)

Abdou
Abdou

Reputation: 13274

This is a somewhat general helper function which should work on almost any data.frame (written in python2, for python3 testing, please make sure to wrap the map and reduce functions with list):

def get_multiple_dummies(dframe):
    from functools import reduce
    combined = [pd.get_dummies(dframe.iloc[:, i]) for i in range(len(dframe.columns))]
    allcolumns = set(reduce(list.__add__, map(lambda y: y.columns.tolist(), 
        combined)))
    combined = map(lambda x: pd.concat([x, pd.DataFrame(
        columns = filter(lambda y: y not in x.columns, 
        allcolumns))]), combined)
    return reduce(lambda x,y: x.fillna(0)+y.fillna(0), combined)

print get_multiple_dummies(qq1)

   aus  bel  chn  fra  gbr  prt
0    1    0    0    1    1    0
1    0    0    0    0    1    1
2    0    0    1    0    0    0
3    0    1    0    0    1    0
4    0    0    0    0    1    1
5    0    0    0    0    1    1
6    0    0    0    0    1    1

[7 rows x 6 columns]

Upvotes: 1

Nickil Maveli
Nickil Maveli

Reputation: 29711

You could tweak the parameters inside get_dummies such that the prefix of the columns formed is removed and sum the columns with same name to obtain the desired frame.

df = pd.get_dummies(df, prefix='', prefix_sep='')

df.groupby(df.columns, axis=1).agg(np.sum).astype(int)

   aus  bel  chn  fra  gbr  prt
0    1    0    0    1    1    0
1    0    0    0    0    1    1
2    0    0    1    0    0    0
3    0    1    0    0    1    0
4    0    0    0    0    1    1
5    0    0    0    0    1    1
6    0    0    0    0    1    1

Upvotes: 1

root
root

Reputation: 33803

You can do some preprocessing of xlst_entries to combine all entries as a single string separated by |, then use Series.str.get_dummies:

xlst_entries = ['|'.join(x) for x in xlst_entries]
qq1 = pd.Series(xlst_entries).str.get_dummies()

The resulting output:

   aus  bel  chn  fra  gbr  prt
0    1    0    0    1    1    0
1    0    0    0    0    1    1
2    0    0    1    0    0    0
3    0    1    0    0    1    0
4    0    0    0    0    1    1
5    0    0    0    0    1    1
6    0    0    0    0    1    1

Upvotes: 1

Related Questions