MEhsan
MEhsan

Reputation: 2324

How to use groupby and take the difference between the two groups?

Let say I have this simplified dataframe with three variables:

 ID    sample  test_result
 P1    Normal           9
 P2    Normal           7
 P3    Normal           2
 P1     Tumor           6
 P2     Tumor           5
 P3     Tumor           3

I want to know the difference between the normal and tumor test_results, but I am not sure how to approach it.

I have tried using groupby on sample column and then use the diff() method on test_result column but that did not work.

Here is what I have tried:

df.groupby('sample')['test_result'].diff()

The output I am expecting is like:

ID   test_result
P1             3 # the difference between P1 Normal and P1 Tumor (i.e. 9-6)
P2             2
P3            -1 

Any idea how to tackle this?

Upvotes: 1

Views: 3490

Answers (2)

BENY
BENY

Reputation: 323226

Or we can keep using groupby

df['New']=df.sort_values('ID').groupby('ID')['test_result'].diff()
df
Out[636]: 
   ID  sample  test_result  New
0  P1  Normal            9  NaN
1  P2  Normal            7  NaN
2  P3  Normal            2  NaN
3  P1   Tumor            6 -3.0
4  P2   Tumor            5 -2.0
5  P3   Tumor            3  1.0
df.dropna()
Out[637]: 
   ID sample  test_result  New
3  P1  Tumor            6 -3.0
4  P2  Tumor            5 -2.0
5  P3  Tumor            3  1.0


df.dropna()[['ID','New']]
Out[639]: 
   ID  New
3  P1 -3.0
4  P2 -2.0
5  P3  1.0

Upvotes: 0

akuiper
akuiper

Reputation: 214927

You can pivot sample column to the headers and then subtract between Normal and Tumor:

df.pivot('ID', 'sample', 'test_result').assign(diff = lambda x: x.Normal - x.Tumor)['diff']

#ID
#P1    3
#P2    2
#P3   -1
#Name: diff, dtype: int64

Upvotes: 2

Related Questions