Reputation: 335
Scaling the target variable by normal procedure of using StandardScaler class is giving error. However, the error got resolved by adding a line y = y.reshape(-1,1)
. After which applying the fit_transform method on target variable gave the standardized value. I am not able to figure out how adding y.reshape(-1,1)
made it work?
X is independent variable having one feature and y is the numerical target variable 'Salary'. I was trying to apply Support Vector Regression to the problem, which needs explicit feature scaling. I tried the following code:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
It gave me error like:
ValueError: Expected 2D array, got 1D array instead Reshape your data either using
array.reshape(-1, 1)
if your data has a single feature orarray.reshape(1, -1)
if it contains a single sample.
After I made the following changes:
X = sc_X.fit_transform(X)
y = y.reshape(-1,1)
y = sc_y.fit_transform(y)
The standardization worked just fine. I need to understand how adding this y = y.reshape(-1,1)
helped achieve it.
Thanks.
Upvotes: 1
Views: 2666
Reputation: 2895
This comes up a lot in SKLearn.
From the docs of the scaler's .transform
function, the input to .transform
has to be a 2D matrix where the second dimension is the number of features:
Perform standardization by centering and scaling
Parameters: X : array-like, shape [n_samples, n_features] The data used to scale along the features axis.
Now, the last dimension has to be explicitly set to 1, not missing. Before the data is reshaped (i.e. y=y.reshape(-1,1)
), the last dimension is missing - see this example:
import numpy as np
a = np.array([0,0,0])
print(a) # [0 0 0]
print(a.shape) # (3,)
b = a.reshape(-1,1)
print(b) # [[0] [0] [0]]
print(b.shape) # (3,1)
The reshape method changes the shape of an array: for example, if a is an array with 6 elements (and whatever shape), a.reshape(3,2)
changes its shape to 3-by-2.
The -1 argument basically means "use the dimension that is needed here so that the data fits".
So, a.reshape(-1,1)
an array with n elements to an n-by-1 2d array (without explicitly specifying n).
Upvotes: 1
Reputation: 4779
In short, yes you would need to transform it. This is because as per the sklearn documentation, fit_transform
expects the X
or predictor variables to consist of n_samples
with n_features
, which is make sense to what it used for. Supplying only 1-D array, this function will read it as 1 sample of n_feature
. Perhaps attaching the code below will make this clearer:
In [1]: x_arr
Out[1]: array([1, 2, 3, 4, 5]) # will be considered as 1 sample of 5 feature
In [2]: x_arr.reshape(-1,1)
Out[2]:
array([[1], # 1st sample
[2], # 2nd sample
[3], # 3rd sample
[4], # 4th sample
[5]])# 5th sample
Anyway, on how you use the StandardScaler
(unrelated to your question on why your code produce error, which answered above), what you want to do is using the same StandardScaler
throughout your data. Generally speaking, scaling the target variable isn't necessary since it's the variable you'd like to predict, not the predictor (assuming y
in your code is the target variable).
First, you'd like to store the mean and standard deviation of your training data to be used later in for scaling the test data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Here the scaler will learn the mean and std of train data
x_train_scaled = scaler.fit_transform(x_train, y_train)
# Use here to transform test data
# This ensures both the train and test data are in the same scale
x_test_scaled = scaler.transform(x_test)
Hope this helps!
Upvotes: 3