physicsmichael
physicsmichael

Reputation: 4963

How to normalize by another row in a pandas DataFrame?

I'm having trouble finding the non-for loop version of this. Say this is my input:

In [94]: df
Out[94]: 
        N  experiment   color  value    value2
0  145000           0     red   0.30  0.363953
1   14000           1     red   0.31  0.218978
2   13000           2     red   0.29  0.948070
3   15000           0  yellow   0.31  0.620201
4    1200           1  yellow   0.32  0.567513
5    1400           2  yellow   0.31  0.318197
6   40000           0   green   0.29  0.947226
7    3000           1   green   0.31  0.084243
8    7000           2   green   0.32  0.961020

[9 rows x 5 columns]

Experiment 0 is my control. I performed this experiment for various colors. I want to normalize all rows by the matching color experiment 0.

In [104]: df
Out[104]: 
        N  experiment   color  value    value2  scaled_value  scaled_value2
0  145000           0     red   0.30  0.363953      1.000000       1.000000
1   14000           1     red   0.31  0.218978      1.033333       0.590786
2   13000           2     red   0.29  0.948070      0.966667       2.604732
3   15000           0  yellow   0.31  0.620201      1.000000       1.000000
4    1200           1  yellow   0.32  0.567513      1.032258       0.914220
5    1400           2  yellow   0.31  0.318197      1.000000       0.512737
6   40000           0   green   0.29  0.947226      1.000000       1.000000
7    3000           1   green   0.31  0.084243      1.068966       0.088680
8    7000           2   green   0.32  0.961020      1.103448       1.014541

[9 rows x 7 columns]

It seems I would want to exact the experiment zero parameters to their own column to permit easy division, but I can't get the stacking/pivoting/concat'ing to work properly. I did this with for loops using .at and assigning values but it felt so wrong.

The next steps would be normalizing several columns and creating upplower and lower error bounds calculated from (N, value), (N, value2), (N, other_values) etc.

Upvotes: 6

Views: 101

Answers (1)

DSM
DSM

Reputation: 353119

One way would be to use transform (here using idxmin, although there are many alternatives) to get the indices of the rows we want to use as the denominator:

>>> ii = df.groupby("color")["experiment"].transform("idxmin")
>>> cols = ["value", "value2"]
>>> new_cols = (df.loc[:,cols] /df.loc[ii, cols].values)
>>> df.join(new_cols.rename(columns=lambda x: "scaled_" + x))
        N  experiment   color  value    value2  scaled_value  scaled_value2
0  145000           0     red   0.30  0.363953      1.000000       1.000000
1   14000           1     red   0.31  0.218978      1.033333       0.601666
2   13000           2     red   0.29  0.948070      0.966667       2.604924
3   15000           0  yellow   0.31  0.620201      1.000000       1.000000
4    1200           1  yellow   0.32  0.567513      1.032258       0.915047
5    1400           2  yellow   0.31  0.318197      1.000000       0.513055
6   40000           0   green   0.29  0.947226      1.000000       1.000000
7    3000           1   green   0.31  0.084243      1.068966       0.088937
8    7000           2   green   0.32  0.961020      1.103448       1.014563

Step by step, first we find the denominator indices:

>>> ii = df.groupby("color")["experiment"].transform("idxmin")
>>> ii
0    0
1    0
2    0
3    3
4    3
5    3
6    6
7    6
8    6
dtype: int64

Then we can use this to index into the frame:

>>> df.loc[ii, cols]
   value    value2
0   0.30  0.363953
0   0.30  0.363953
0   0.30  0.363953
3   0.31  0.620201
3   0.31  0.620201
3   0.31  0.620201
6   0.29  0.947226
6   0.29  0.947226
6   0.29  0.947226

Since we're handling the alignment ourselves, we need to call .values to drop down to the underlying array-- otherwise pandas will try to outsmart us and align things correctly based on the indices.

Then we divide:

>>> (df.loc[:,cols] /df.loc[ii, cols].values)
      value    value2
0  1.000000  1.000000
1  1.033333  0.601666
2  0.966667  2.604924
3  1.000000  1.000000
4  1.032258  0.915047
5  1.000000  0.513055
6  1.000000  1.000000
7  1.068966  0.088937
8  1.103448  1.014563

and finally join them, renaming the new columns:

>>> df.join(new_cols.rename(columns=lambda x: "scaled_" + x))
        N  experiment   color  value    value2  scaled_value  scaled_value2
0  145000           0     red   0.30  0.363953      1.000000       1.000000
1   14000           1     red   0.31  0.218978      1.033333       0.601666
2   13000           2     red   0.29  0.948070      0.966667       2.604924
3   15000           0  yellow   0.31  0.620201      1.000000       1.000000
4    1200           1  yellow   0.32  0.567513      1.032258       0.915047
5    1400           2  yellow   0.31  0.318197      1.000000       0.513055
6   40000           0   green   0.29  0.947226      1.000000       1.000000
7    3000           1   green   0.31  0.084243      1.068966       0.088937
8    7000           2   green   0.32  0.961020      1.103448       1.014563

Upvotes: 4

Related Questions