Reputation: 2324
Let say I have this simplified dataframe with three variables:
ID sample test_result
P1 Normal 9
P2 Normal 7
P3 Normal 2
P1 Tumor 6
P2 Tumor 5
P3 Tumor 3
I want to know the difference between the normal and tumor test_results
, but I am not sure how to approach it.
I have tried using groupby on sample
column and then use the diff()
method on test_result
column but that did not work.
Here is what I have tried:
df.groupby('sample')['test_result'].diff()
The output I am expecting is like:
ID test_result
P1 3 # the difference between P1 Normal and P1 Tumor (i.e. 9-6)
P2 2
P3 -1
Any idea how to tackle this?
Upvotes: 1
Views: 3490
Reputation: 323226
Or we can keep using groupby
df['New']=df.sort_values('ID').groupby('ID')['test_result'].diff()
df
Out[636]:
ID sample test_result New
0 P1 Normal 9 NaN
1 P2 Normal 7 NaN
2 P3 Normal 2 NaN
3 P1 Tumor 6 -3.0
4 P2 Tumor 5 -2.0
5 P3 Tumor 3 1.0
df.dropna()
Out[637]:
ID sample test_result New
3 P1 Tumor 6 -3.0
4 P2 Tumor 5 -2.0
5 P3 Tumor 3 1.0
df.dropna()[['ID','New']]
Out[639]:
ID New
3 P1 -3.0
4 P2 -2.0
5 P3 1.0
Upvotes: 0
Reputation: 214927
You can pivot sample
column to the headers and then subtract between Normal and Tumor:
df.pivot('ID', 'sample', 'test_result').assign(diff = lambda x: x.Normal - x.Tumor)['diff']
#ID
#P1 3
#P2 2
#P3 -1
#Name: diff, dtype: int64
Upvotes: 2