Reputation: 98
I have an NxN DataFrame with values I need to scale to a range of values that signify importance, where 0 is irrelevant and 3 is very important.
The formula I'm using to scale of course depends on the min and max values in each column, which are different for each column: Col A's range could be 1-12 while Col B's range could be 1M to 45M.
Here's the formula I'm using.
min_importance + ((max_importance - min_importance) / (max_spec_value - min_spec_value)) * (spec_value - min_spec_value)
How do I create a new DataFrame or dictionary with scaled values for each column, while retaining the index, which is needed later for identification?
I tried creating a function with the above formula, and using apply() to call the function for each row, but I can't pass column min/max to the function, so that doesn't work.
DataFrame sample ("Body: retail price" and "Body: sensor resolution" are columns):
Body: retail price Body: sensor resolution Body name Nikon D500 2000.00 20668416 Nikon D7000 1200.00 16084992 Sony Alpha 7R II 3199.00 42177408 Canon EOS 5D Mark III 3499.00 22118400 Canon 7D Mark II 1799.00 19961856 iPhone 6 (front) 699.00 1000000 iPhone 6 (rear) 699.00 7990272 Fujifilm X-T1 1299.95 15980544 Fujifilm X-T2 1599.00 24000000
Upvotes: 2
Views: 1274
Reputation:
min-max normalization can be done with:
(df - df.min()) / (df.max() - df.min())
Out:
Body: retail price Body: sensor resolution
Body name
Nikon D500 0.464643 0.477651
Nikon D7000 0.178929 0.366341
Sony Alpha 7R II 0.892857 1.000000
Canon EOS 5D Mark III 1.000000 0.512864
Canon 7D Mark II 0.392857 0.460492
iPhone 6 (front) 0.000000 0.000000
iPhone 6 (rear) 0.000000 0.169760
Fujifilm X-T1 0.214625 0.363805
Fujifilm X-T2 0.321429 0.558559
You don't need apply. df.min()
will return a series and when doing df - df.min()
pandas will subtract corresponding column's minimum value from each value. This is called broadcasting which makes the task easier.
If you have different importance levels for each column, best thing to do would be to store it in a dataframe:
importances = pd.DataFrame({'max_imp': [1, 3], 'min_imp': [0, 0]}, index= df.columns)
importances
Out:
max_imp min_imp
Body: retail price 1 0
Body: sensor resolution 3 0
Now with the same principle, you can adjust your formula:
importances['min_imp'] + ((importances['max_imp'] - importances['min_imp']) / (df.max() - df.min())) * (df - df.min())
Out:
Body: retail price Body: sensor resolution
Body name
Nikon D500 0.464643 1.432952
Nikon D7000 0.178929 1.099024
Sony Alpha 7R II 0.892857 3.000000
Canon EOS 5D Mark III 1.000000 1.538591
Canon 7D Mark II 0.392857 1.381475
iPhone 6 (front) 0.000000 0.000000
iPhone 6 (rear) 0.000000 0.509280
Fujifilm X-T1 0.214625 1.091415
Fujifilm X-T2 0.321429 1.675676
Note that the index of importances
and the columns of the actual dataframe should match. In this example, the first column's range is converted to [0-1] and the second column's range to [0-3].
Upvotes: 2