Carlos Carvalho
Carlos Carvalho

Reputation: 137

Min Max Scaler on parts of data

I have a dataframe like this:

### To Stack

df_stack = pd.DataFrame([['SGA', 'FT', 5.6994352340698242],
   ['SGA', 'ST', 1.518241786956787],
   ['EXT', 'FT', 20.853669309616087],
   ['EXT', 'ST', 3.5596847057342529],
   ['LGT', 'FT', 9.853669309616087],
   ['LGT', 'ST', 8.5596847057342529]], columns = (('idobj','description','value')))

df_stack

    idobj   description value
    SGA FT         5.69944
    SGA ST         1.51824
    EXT FT        20.85367
    EXT ST         3.55968
    LGT FT         9.85367
    LGT ST         8.5596

I need to apply min-max scaling on parts of data (e.g., apply MinMaxScaler on 'Description'='ST', then apply MinMaxScaler on 'Description'='ST').

When I apply MinMaxScaler for each "Description" this way:

scaler.fit(df_stack[(df_stack['description']=='FT')]['value'].values.reshape(-1,1))
scaler.transform(df_stack[(df_stack['description']=='FT')]['value'].values.reshape(-1,1))

array([[0.     ],
       [1.     ],
       [0.27413]])

scaler.fit(df_stack[(df_stack['description']=='ST')]['value'].values.reshape(-1,1))
scaler.transform(df_stack[(df_stack['description']=='ST')]['value'].values.reshape(-1,1))

array([[0.     ],
       [0.28992],
       [1.     ]])

I end up losing the correspondence with my "Description".

I've tried to manually compute by adding a new column like this:

df_stack['description_norm'] = (df_stack[(df_stack['description']=='FT')]['value']-\
                              df_stack[(df_stack['description']=='FT')]['value'].min())/\
                            (df_stack[(df_stack['description']=='FT')]['value'].max()-\
                             df_stack[(df_stack['description']=='FT')]['value'].min())

But then I get another column:

idobj   description value   description_norm
0   SGA FT  5.69944        0.00000
1   SGA ST  1.51824        NaN
2   EXT FT  20.85367       1.00000
3   EXT ST  3.55968        NaN
4   LGT FT  9.85367        0.27413
5   LGT ST  8.55968        NaN

And when I try to assign the values on this new column to the variable 'Description' based on the condition 'FT' it simply doesn't work?!?

df_stack[(df_stack['description']=='FT')]['value']=df_stack[(df_stack['description']=='FT')]['description_norm']

What am I missing?

The expected result should be :

    idobj   description value
    SGA FT         0
    SGA ST         0
    EXT FT         1
    EXT ST         0.28992
    LGT FT         0.27413
    LGT ST         1

Any ideas are welcome.

Upvotes: 2

Views: 473

Answers (1)

Shaido
Shaido

Reputation: 28322

What you want to do is applying a function on a column based on the value of another column, in pandas the best way to do this is by using groupby.

In this case, we can do the scaling manually like this:

df_stack['value'] = df_stack.groupby('description')['value'].\
    apply(lambda x: (x-x.min())/(x.max()-x.min()))

or by using MinMaxScaler:

df_stack['value'] = df_stack.groupby('description')['value'].\
    transform(lambda x: MinMaxScaler().fit_transform(x.values.reshape(-1,1)).flatten())

Both will give the same result:

  idobj description     value
0   SGA          FT  0.000000
1   SGA          ST  0.000000
2   EXT          FT  1.000000
3   EXT          ST  0.289918
4   LGT          FT  0.274130
5   LGT          ST  1.000000

Upvotes: 2

Related Questions