How to apply transform method to partial data frame

Question

I have a data frame with many columns. I want to apply MinMaxScaler in from sklearn.preprocessing import MinMaxScaler to the data frame.

However, previously I merger two dataframes into one and added a column 'type' to differentiate. So 'type' column has two values 'train' and 'test'.

Now when I wanted to apply the scalar to it, it throws error saying the values should be numbers. I get that.

So I wrote the code like this:

scalar = MinMaxScaler()
data = pd.DataFrame(scalar.fit_transform(data), columns=data.columns.drop("type"))

data.head()

This means I give only the specified column for it right? But still I'm getting the same error.

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

 in 
      1 scalar = MinMaxScaler()
----> 2 data = pd.DataFrame(scalar.fit_transform(data), columns=data.columns.drop("type"))
      3 
      4 data.head()
      5 

ValueError: could not convert string to float: 'train'

All I want is to apply the transform to all the column except that specific "type" column. Is there a way to do that with Pandas? If not, what are the alternatives?

I'm new to these stuff so please explain your solutions a bit more.

Also I noticed all methods of sklearn are made to work with numpy and not pandas dataframe. Everything takes an array as argument rather than a dataframe. Is it possible to send in data frames as inputs?

afsharov · Accepted Answer

You are still applying the MinMaxScaler to the whole dataframe with scalar.fit_transform(data).

One way to cope with this issue would be to:

create a new dataframe df without the 'type' column
scale df with MinMaxScaler and wrap it in a pandas dataframe again
add the 'type' column of the original dataframe to df

Here is the corresponding code:

from sklearn.preprocessing import MinMaxScaler
import pandas as pd


df = data.drop("type", axis=1)
df = pd.DataFrame(
    MinMaxScaler().fit_transform(df),
    columns=df.columns
)
df["type"] = data["type"]

Regarding your question about pandas dataframes as input, there is in general no problem. The scikit-learn API assumes array-like objects which include pandas dataframes. Notice however that the return type (also called output) is typically a numpy array, which might need to be taken care of (like in step 2 above).

How to apply transform method to partial data frame

Answers (1)

Related Questions