Elysire
Elysire

Reputation: 713

Pandas - Merge rows and add columns with 'get_dummies'

With the following dataframe:

import pandas as pd
df=pd.DataFrame(data=[[1,5179530,'rs10799170',8.1548,'E001'], [1,5179530,'rs10799170',8.1548,'E002'], [1,5179530,'rs10799170',8.1548,'E003'], [1,455521,'rs235884',2.584,'E003'], [1,455521,'rs235884',2.584,'E007']], col    umns=['CHR','BP','SNP','CM','ANNOT'])

   CHR       BP         SNP      CM ANNOT
0    1  5179530  rs10799170  8.1548  E001
1    1  5179530  rs10799170  8.1548  E002
2    1  5179530  rs10799170  8.1548  E003
3    1   455521    rs235884  2.5840  E003
4    1   455521    rs235884  2.5840  E007

I would like to obtain

   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1  5179530  rs10799170  8.1548     1     1     1     0  
1    1   455521    rs235884  2.5840     0     0     1     1

I tried groupby() and get_dummies() separately

df.groupby(['CHR','BP','SNP','CM']).sum()

    CHR BP      SNP        CM         ANNOT           
1   455521  rs235884   2.5840      E003E007
    5179530 rs10799170 8.1548  E001E002E003

pd.get_dummies(df['ANNOT'])

    E001  E002  E003  E007
0     1     0     0     0
1     0     1     0     0
2     0     0     1     0
3     0     0     1     0
4     0     0     0     1

But I don't know how to combine both or if there is another way.

Upvotes: 13

Views: 10694

Answers (3)

Zero
Zero

Reputation: 76947

Here's one way, using groupby and apply

In [66]: (df.groupby(['CHR', 'BP', 'SNP', 'CM'])
            .apply(lambda x: {y:1 for y in x['ANNOT']})
            .apply(pd.Series)
            .fillna(0)
            .reset_index())
Out[66]:
   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1   455521    rs235884  2.5840   0.0   0.0   1.0   1.0
1    1  5179530  rs10799170  8.1548   1.0   1.0   1.0   0.0

Upvotes: 4

Roelant
Roelant

Reputation: 5119

You are very close! Just combine the two techniques:

dummies = pd.get_dummies(df['ANNOT'])
combine = pd.concat([df, dummies], axis=1)
out = combine.groupby(['BP','CHR','SNP','CM']).sum().reset_index()

Or depending on your application you might want to use .max instead of sum. Note that I changed the order in the groupby to prevent one CHR group. just get the results in the order you want with:

out = out[['CHR', 'BP', 'SNP', 'CM'] + list(dummies)]

Upvotes: 7

bunji
bunji

Reputation: 5223

As @Dadep points out in their comment, this can be achieved with a pivot table. If you want to stick to your get_dummies + groupby technique though you can do something like:

pd.concat([df, pd.get_dummies(df.ANNOT)], 1).groupby(['CHR','BP','SNP','CM']).sum().reset_index()

This first concatenates your dataframe and the output of the get_dummies call, then it groups the result according to the relevant columns, takes the sum of those columns among those groups and then resets the index so you don't have to deal with a multi-index data frame. The result looks like:

   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1   455521    rs235884  2.5840     0     0     1     1
1    1  5179530  rs10799170  8.1548     1     1     1     0

Upvotes: 11

Related Questions