Reputation: 1017
I would like to normalize a training and test data set using MinMaxScaler
in sklearn.preprocessing
. However, the package does not appear to be accepting my test data set.
import pandas as pd
import numpy as np
# Read in data.
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
'Proline']
# Split into train/test data.
from sklearn.model_selection import train_test_split
X = df_wine.iloc[:, 1:].values
y = df_wine.iloc[:, 0].values
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3,
random_state = 0)
# Normalize features using min-max scaling.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
When executing this, I get a DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
along with a ValueError: operands could not be broadcast together with shapes (124,) (13,) (124,)
.
Reshaping the data still yields an error.
X_test_norm = mms.transform(X_test.reshape(-1, 1))
This reshaping yields an error ValueError: non-broadcastable output operand with shape (124,1) doesn't match the broadcast shape (124,13)
.
Any input on how to get fix this error would be helpful.
Upvotes: 4
Views: 22565
Reputation: 29711
The partitioning of train/test data must be specified in the same order as the input array to the train_test_split()
function for it to unpack them corresponding to that order.
Clearly, when the order was specified as X_train, y_train, X_test, y_test
, the resulting shapes of y_train
(len(y_train)=54
) and X_test
(len(X_test)=124
) got swapped resulting in the ValueError
.
Instead, you must:
# Split into train/test data.
# _________________________________
# | | \
# | | \
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# | | /
# |__________|_____________________________________/
# (or)
# y_train, y_test, X_train, X_test = train_test_split(y, X, test_size=0.3, random_state=0)
# Normalize features using min-max scaling.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
produces:
X_train_norm[0]
array([ 0.72043011, 0.20378151, 0.53763441, 0.30927835, 0.33695652,
0.54316547, 0.73700306, 0.25 , 0.40189873, 0.24068768,
0.48717949, 1. , 0.5854251 ])
X_test_norm[0]
array([ 0.72849462, 0.16386555, 0.47849462, 0.29896907, 0.52173913,
0.53956835, 0.74311927, 0.13461538, 0.37974684, 0.4364852 ,
0.32478632, 0.70695971, 0.60566802])
Upvotes: 4