Reputation: 1116
I have a data frame with many columns. I want to apply MinMaxScaler
in from sklearn.preprocessing import MinMaxScaler
to the data frame.
However, previously I merger two dataframes into one and added a column 'type' to differentiate. So 'type' column has two values 'train' and 'test'.
Now when I wanted to apply the scalar to it, it throws error saying the values should be numbers. I get that.
So I wrote the code like this:
scalar = MinMaxScaler()
data = pd.DataFrame(scalar.fit_transform(data), columns=data.columns.drop("type"))
data.head()
This means I give only the specified column for it right? But still I'm getting the same error.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-216-85712d22d323> in <module>
1 scalar = MinMaxScaler()
----> 2 data = pd.DataFrame(scalar.fit_transform(data), columns=data.columns.drop("type"))
3
4 data.head()
5
ValueError: could not convert string to float: 'train'
All I want is to apply the transform to all the column except that specific "type" column. Is there a way to do that with Pandas? If not, what are the alternatives?
I'm new to these stuff so please explain your solutions a bit more.
Also I noticed all methods of sklearn are made to work with numpy and not pandas dataframe. Everything takes an array as argument rather than a dataframe. Is it possible to send in data frames as inputs?
Upvotes: 0
Views: 435
Reputation: 5164
You are still applying the MinMaxScaler
to the whole dataframe with
scalar.fit_transform(data)
.
One way to cope with this issue would be to:
df
without the 'type' columndf
with MinMaxScaler
and wrap it in a pandas dataframe againdf
Here is the corresponding code:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
df = data.drop("type", axis=1)
df = pd.DataFrame(
MinMaxScaler().fit_transform(df),
columns=df.columns
)
df["type"] = data["type"]
Regarding your question about pandas dataframes as input, there is in general no problem. The scikit-learn
API assumes array-like objects which include pandas dataframes. Notice however that the return type (also called output) is typically a numpy array, which might need to be taken care of (like in step 2 above).
Upvotes: 1