Pandas DataFrame: Compute values based on column min/max

Question

I have an NxN DataFrame with values I need to scale to a range of values that signify importance, where 0 is irrelevant and 3 is very important.

The formula I'm using to scale of course depends on the min and max values in each column, which are different for each column: Col A's range could be 1-12 while Col B's range could be 1M to 45M.

Here's the formula I'm using.

min_importance + ((max_importance - min_importance) / (max_spec_value - min_spec_value)) * (spec_value - min_spec_value)

How do I create a new DataFrame or dictionary with scaled values for each column, while retaining the index, which is needed later for identification?

I tried creating a function with the above formula, and using apply() to call the function for each row, but I can't pass column min/max to the function, so that doesn't work.

DataFrame sample ("Body: retail price" and "Body: sensor resolution" are columns):

                       Body: retail price  Body: sensor resolution  
Body name                                                            
Nikon D500                        2000.00                 20668416   
Nikon D7000                       1200.00                 16084992   
Sony Alpha 7R II                  3199.00                 42177408   
Canon EOS 5D Mark III             3499.00                 22118400   
Canon 7D Mark II                  1799.00                 19961856   
iPhone 6 (front)                   699.00                  1000000   
iPhone 6 (rear)                    699.00                  7990272   
Fujifilm X-T1                     1299.95                 15980544   
Fujifilm X-T2                     1599.00                 24000000

user2285236 · Accepted Answer

min-max normalization can be done with:

(df - df.min()) / (df.max() - df.min())
Out: 
                       Body: retail price  Body: sensor resolution
Body name                                                         
Nikon D500                       0.464643                 0.477651
Nikon D7000                      0.178929                 0.366341
Sony Alpha 7R II                 0.892857                 1.000000
Canon EOS 5D Mark III            1.000000                 0.512864
Canon 7D Mark II                 0.392857                 0.460492
iPhone 6 (front)                 0.000000                 0.000000
iPhone 6 (rear)                  0.000000                 0.169760
Fujifilm X-T1                    0.214625                 0.363805
Fujifilm X-T2                    0.321429                 0.558559

You don't need apply. df.min() will return a series and when doing df - df.min() pandas will subtract corresponding column's minimum value from each value. This is called broadcasting which makes the task easier.

If you have different importance levels for each column, best thing to do would be to store it in a dataframe:

importances = pd.DataFrame({'max_imp': [1, 3], 'min_imp': [0, 0]}, index= df.columns)
importances
Out: 
                         max_imp  min_imp
Body: retail price             1        0
Body: sensor resolution        3        0

Now with the same principle, you can adjust your formula:

importances['min_imp'] + ((importances['max_imp'] - importances['min_imp']) / (df.max() - df.min())) * (df - df.min())
Out: 
                       Body: retail price  Body: sensor resolution
Body name                                                         
Nikon D500                       0.464643                 1.432952
Nikon D7000                      0.178929                 1.099024
Sony Alpha 7R II                 0.892857                 3.000000
Canon EOS 5D Mark III            1.000000                 1.538591
Canon 7D Mark II                 0.392857                 1.381475
iPhone 6 (front)                 0.000000                 0.000000
iPhone 6 (rear)                  0.000000                 0.509280
Fujifilm X-T1                    0.214625                 1.091415
Fujifilm X-T2                    0.321429                 1.675676

Note that the index of importances and the columns of the actual dataframe should match. In this example, the first column's range is converted to [0-1] and the second column's range to [0-3].

Pandas DataFrame: Compute values based on column min/max

Answers (1)

Related Questions