Reputation: 1267
I have a dataframe like:
TOTAL | Name
3232 Jane
382 Jack
8291 Jones
I'd like to create a newly scaled column in the dataframe called SIZE
where SIZE
is a number between 5 and 50.
For Example:
TOTAL | Name | SIZE
3232 Jane 24.413
382 Jack 10
8291 Jones 50
I've tried
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
scaler=MinMaxScaler(feature_range=(10,50))
df["SIZE"]=scaler.fit_transform(df["TOTAL"])
but got Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I've tried other things, such as creating a list, transforming it, and appending it back to the dataframe, among other things.
What is the easiest way to do this?
Thanks!
Upvotes: 13
Views: 35912
Reputation: 1754
You can use minmax_scale
to normalize a column
from sklearn.preprocessing import minmax_scale
df['size'] = minmax_scale(df['total'])
Upvotes: 0
Reputation: 1
I have use this function several times, you can use it to normalize your dataset
def standardize_function(X_train):
df_scaled = pd.DataFrame(MinMaxScaler().fit_transform(X_train), columns = X_train.columns)
return df_scaled
X_train = standardize_function(X_train)
You can try it out and see if it helps
Upvotes: 0
Reputation: 1339
In case you want to scale only one column in the dataframe, you have to reshape the column values as follows:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['SIZE'] = scaler.fit_transform(df['TOTAL'].values.reshape(-1,1))
Upvotes: 6
Reputation: 402483
Option 1
sklearn
You see this problem time and time again, the error really should be indicative of what you need to do. You're basically missing a superfluous dimension on the input. Change df["TOTAL"]
to df[["TOTAL"]]
.
df['SIZE'] = scaler.fit_transform(df[["TOTAL"]])
df
TOTAL Name SIZE
0 3232 Jane 24.413959
1 382 Jack 10.000000
2 8291 Jones 50.000000
Option 2
pandas
Preferably, I would bypass sklearn and just do the min-max scaling myself.
a, b = 10, 50
x, y = df.TOTAL.min(), df.TOTAL.max()
df['SIZE'] = (df.TOTAL - x) / (y - x) * (b - a) + a
df
TOTAL Name SIZE
0 3232 Jane 24.413959
1 382 Jack 10.000000
2 8291 Jones 50.000000
This is essentially what the min-max scaler does, but without the overhead of importing scikit learn (don't do it unless you have to, it's a heavy library).
Upvotes: 28