Reputation: 11793
I made up a dataframe to illustrate my question. Say, I have three patients: 'a', 'b', 'c'. We got results from these patient at three different time points (t1, t2, t3). What I need is to create another column 'Fold', which is the fold change from t1. Since patient 'c' does not has any result at t1, its fold change vs t1 should be nan
.
Below is the code:
df = pd.DataFrame ({ \
'time': np.repeat(['t1','t2','t3'], [2,3,3]),
'id': ['a', 'b', 'a', 'b', 'c', 'a', 'b', 'c'],
'result':np.random.randint(10,20,size=8) })
# create indicator column has_t1: if a patient has t1: 1 if not: 0
df['is_t1'] = np.where(df['time']=='t1', 1, 0)
df['has_t1'] = df.groupby('id')['is_t1'].transform(sum)
# create fold change column
df['fold'] =df.sort_values(['id', 'time']).groupby('id').apply(lambda x: x['result']/x['result'].iloc[0] if x['has_t1'].iloc[0]==1 else np.nan)
I got error:
AttributeError: 'float' object has no attribute 'index'
My desired output is something like:
Fold
id time
a t1 1.000000
t2 1.545455
t3 1.000000
b t1 1.000000
t2 1.062500
t3 0.937500
c
t2 NaN
t3 NaN
Does anyone know what I did wrong? Thank you for the help in advance.
Upvotes: 2
Views: 719
Reputation: 19947
Another way using groupby, apply and stack.
def fold(x):
df_tmp = x.set_index('time')
df_tmp = df_tmp.reindex(df_tmp.index.union(['t1']))
return df_tmp.result.div(df_tmp.loc['t1','result'])
df.groupby('id').apply(fold).stack(dropna=False)
Out[229]:
id result
a t1 1.000000
t2 0.923077
t3 0.923077
b t1 1.000000
t2 1.300000
t3 1.400000
c t1 NaN
t2 NaN
t3 NaN
Upvotes: 0
Reputation: 323276
Ummm , there is a another way
s=df.set_index(['id','time']).reindex(pd.MultiIndex.from_product([set(df.id.tolist()),set(df.time.tolist())]))
s=s.sort_index()
s.result.div(s.groupby(level=0).result.nth(0),level=0)
Out[256]:
a t1 1.000000
t2 1.900000
t3 1.800000
b t1 1.000000
t2 0.736842
t3 0.578947
c t1 NaN
t2 NaN
t3 NaN
Name: result, dtype: float64
Upvotes: 1
Reputation: 402553
Here's an alternate approach that does not involve an indicator column. First, unstack
and then re-stack
without dropping NaNs:
df = df.set_index(['id', 'time']).unstack().stack(dropna=False)
df
result
id time
a t1 12.0
t2 18.0
t3 13.0
b t1 13.0
t2 11.0
t3 13.0
c t1 NaN
t2 13.0
t3 17.0
Next, call groupby
+ transform
+ head
and divide df.result
by this output:
df['result'] /= df.groupby(level=0).result.transform('head', 1)
df
result
id time
a t1 1.000000
t2 1.545455
t3 1.000000
b t1 1.000000
t2 1.062500
t3 0.937500
c t1 NaN
t2 NaN
t3 NaN
Upvotes: 2