Reputation: 713
With the following dataframe:
import pandas as pd
df=pd.DataFrame(data=[[1,5179530,'rs10799170',8.1548,'E001'], [1,5179530,'rs10799170',8.1548,'E002'], [1,5179530,'rs10799170',8.1548,'E003'], [1,455521,'rs235884',2.584,'E003'], [1,455521,'rs235884',2.584,'E007']], col umns=['CHR','BP','SNP','CM','ANNOT'])
CHR BP SNP CM ANNOT
0 1 5179530 rs10799170 8.1548 E001
1 1 5179530 rs10799170 8.1548 E002
2 1 5179530 rs10799170 8.1548 E003
3 1 455521 rs235884 2.5840 E003
4 1 455521 rs235884 2.5840 E007
I would like to obtain
CHR BP SNP CM E001 E002 E003 E007
0 1 5179530 rs10799170 8.1548 1 1 1 0
1 1 455521 rs235884 2.5840 0 0 1 1
I tried groupby()
and get_dummies()
separately
df.groupby(['CHR','BP','SNP','CM']).sum()
CHR BP SNP CM ANNOT
1 455521 rs235884 2.5840 E003E007
5179530 rs10799170 8.1548 E001E002E003
pd.get_dummies(df['ANNOT'])
E001 E002 E003 E007
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 1 0
4 0 0 0 1
But I don't know how to combine both or if there is another way.
Upvotes: 13
Views: 10694
Reputation: 76947
Here's one way, using groupby
and apply
In [66]: (df.groupby(['CHR', 'BP', 'SNP', 'CM'])
.apply(lambda x: {y:1 for y in x['ANNOT']})
.apply(pd.Series)
.fillna(0)
.reset_index())
Out[66]:
CHR BP SNP CM E001 E002 E003 E007
0 1 455521 rs235884 2.5840 0.0 0.0 1.0 1.0
1 1 5179530 rs10799170 8.1548 1.0 1.0 1.0 0.0
Upvotes: 4
Reputation: 5119
You are very close! Just combine the two techniques:
dummies = pd.get_dummies(df['ANNOT'])
combine = pd.concat([df, dummies], axis=1)
out = combine.groupby(['BP','CHR','SNP','CM']).sum().reset_index()
Or depending on your application you might want to use .max
instead of sum
. Note that I changed the order in the groupby to prevent one CHR group. just get the results in the order you want with:
out = out[['CHR', 'BP', 'SNP', 'CM'] + list(dummies)]
Upvotes: 7
Reputation: 5223
As @Dadep points out in their comment, this can be achieved with a pivot table. If you want to stick to your get_dummies
+ groupby
technique though you can do something like:
pd.concat([df, pd.get_dummies(df.ANNOT)], 1).groupby(['CHR','BP','SNP','CM']).sum().reset_index()
This first concatenates your dataframe and the output of the get_dummies
call, then it groups the result according to the relevant columns, takes the sum of those columns among those groups and then resets the index so you don't have to deal with a multi-index data frame. The result looks like:
CHR BP SNP CM E001 E002 E003 E007
0 1 455521 rs235884 2.5840 0 0 1 1
1 1 5179530 rs10799170 8.1548 1 1 1 0
Upvotes: 11