Talysin
Talysin

Reputation: 363

Custom Dummy Coding in Pandas

I have a dataframe with event data. I have two columns: Primary and Secondary. The Primary and Secondary columns both contain lists of tags (e.g., ['Fun event', 'Dance party']).

      primary               secondary                      combined
['booze', 'party']    ['singing', 'dance']    ['booze', 'party', 'singing', 'dance']
    ['concert']        ['booze', 'vocals']     ['concert', 'booze', 'vocals']

I want to dummy code the data so that primary columns have a 1 code, non-observed columns have a 0, and values in the secondary column have a .5 value. Like so:

combined                                 booze        party   singing    dance    concert    vocals
['booze', 'party', 'singing', 'dance']     1            1       .5        .5        0           0
['concert', 'booze', 'vocals']            .5            0        0         0        1          .5

Upvotes: 0

Views: 479

Answers (2)

BENY
BENY

Reputation: 323366

df1=pd.get_dummies(df.combined.apply(pd.Series).stack()).sum(level=0)
df1[df1.apply(lambda x : [x.name in y for y in df.iloc[x.index,2]])]-=0.5

df1
Out[173]: 
   booze  concert  dance  party  singing  vocals
0    1.0        0    0.5      1      0.5     0.0
1    0.5        1    0.0      0      0.0     0.5

Datainput :

df = pd.DataFrame({'primary':   [['booze', 'party'] ,  ['concert']],
                   'secondary':   [['singing', 'dance'], ['booze', 'vocals']],
                   'combined': [['booze', 'party', 'singing', 'dance'],   ['concert', 'booze', 'vocals']]})

Upvotes: 1

miraculixx
miraculixx

Reputation: 10379

Here's one approach that works by transforming the primary and secondary columns' values into columns on the dataframe:

df = pd.DataFrame({
        'primary': [['booze', 'party'], ['concert']],
        'secondary': [['singing', 'dance'], ['booze', 'vocals']],
    })

# create primary and secondary indicator columns
iprim = df.primary.apply(lambda v: pd.Series([1] * len(v), index=v))
isec = df.secondary.apply(lambda v: pd.Series([.5] * len(v), index=v))

# join with primary, then update from secondary columns
df = df.join(iprim).join(isec, rsuffix='_')
df.drop([c for c in df.columns if c.endswith('_')], axis=1, inplace=True)
df.update(isec)
df.fillna(0)

=>

    primary        secondary        booze   concert     party      dance    singing     vocals
0   [booze, party] [singing, dance] 1.0     0.0         1.0         0.5         0.5     0.0
1   [concert]      [booze, vocals]  0.5     1.0         0.0         0.0         0.0     0.5

Note the second .join() uses rsuffix to add columns that were already in primary, whereas .update() is used to overwrite values in the primary columns. .drop() removes these columns. Rearrange to prefer primary over secondary.

Upvotes: 1

Related Questions