Sri2110
Sri2110

Reputation: 335

Scaling the target variable is giving error in Python using StandardScaler of Sklearn library

Scaling the target variable by normal procedure of using StandardScaler class is giving error. However, the error got resolved by adding a line y = y.reshape(-1,1). After which applying the fit_transform method on target variable gave the standardized value. I am not able to figure out how adding y.reshape(-1,1) made it work?

X is independent variable having one feature and y is the numerical target variable 'Salary'. I was trying to apply Support Vector Regression to the problem, which needs explicit feature scaling. I tried the following code:

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
sc_y = StandardScaler()

X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)

It gave me error like:

ValueError: Expected 2D array, got 1D array instead Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

After I made the following changes:

X = sc_X.fit_transform(X)
y = y.reshape(-1,1)
y = sc_y.fit_transform(y)

The standardization worked just fine. I need to understand how adding this y = y.reshape(-1,1) helped achieve it. Thanks.

Upvotes: 1

Views: 2666

Answers (2)

Itamar Mushkin
Itamar Mushkin

Reputation: 2895

This comes up a lot in SKLearn.
From the docs of the scaler's .transform function, the input to .transform has to be a 2D matrix where the second dimension is the number of features:

Perform standardization by centering and scaling

Parameters: X : array-like, shape [n_samples, n_features] The data used to scale along the features axis.

Now, the last dimension has to be explicitly set to 1, not missing. Before the data is reshaped (i.e. y=y.reshape(-1,1)), the last dimension is missing - see this example:

import numpy as np
a = np.array([0,0,0])
print(a) # [0 0 0]
print(a.shape) # (3,)
b = a.reshape(-1,1)
print(b) # [[0] [0] [0]]
print(b.shape) # (3,1)

The reshape method changes the shape of an array: for example, if a is an array with 6 elements (and whatever shape), a.reshape(3,2) changes its shape to 3-by-2.
The -1 argument basically means "use the dimension that is needed here so that the data fits".
So, a.reshape(-1,1) an array with n elements to an n-by-1 2d array (without explicitly specifying n).

Upvotes: 1

Darren Christopher
Darren Christopher

Reputation: 4779

In short, yes you would need to transform it. This is because as per the sklearn documentation, fit_transform expects the X or predictor variables to consist of n_samples with n_features, which is make sense to what it used for. Supplying only 1-D array, this function will read it as 1 sample of n_feature. Perhaps attaching the code below will make this clearer:

In [1]: x_arr                                                                                                                                                                                                     
Out[1]: array([1, 2, 3, 4, 5]) # will be considered as 1 sample of 5 feature

In [2]: x_arr.reshape(-1,1)                                                                                                                                                                                       
Out[2]: 
array([[1], # 1st sample
       [2], # 2nd sample
       [3], # 3rd sample
       [4], # 4th sample
       [5]])# 5th sample

Anyway, on how you use the StandardScaler (unrelated to your question on why your code produce error, which answered above), what you want to do is using the same StandardScaler throughout your data. Generally speaking, scaling the target variable isn't necessary since it's the variable you'd like to predict, not the predictor (assuming y in your code is the target variable).

First, you'd like to store the mean and standard deviation of your training data to be used later in for scaling the test data.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Here the scaler will learn the mean and std of train data
x_train_scaled = scaler.fit_transform(x_train, y_train)

# Use here to transform test data
# This ensures both the train and test data are in the same scale
x_test_scaled = scaler.transform(x_test)

Hope this helps!

Upvotes: 3

Related Questions