Reputation: 1510
I have pandas Dataframe and I am using this to do some regression analysis. I have normalized the data by using the following:
working_df = df.div(np.sqrt(np.sum(np.power(df.values, 2), axis=1)), axis=0)
This Dataframe contains 35 columns as features, so I choose the dataset as follows:
X = working_df.iloc[:, 0:35]
y = target_df['target_property']
Then I use Sklearn to do train - test split:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
My question is: Do I really need to perform sc.transform(X_train) and sc.transform(X_test), as my data is already normalized in the dataframe? If so, do I need to perform sc.fit in X_train before doing sc.transform(X_train), if not why? By doing so, I obtained R2 as 0.46 for linear regression, -0.21 for kernel ridge regression and 0.62 for gradient boosting regressor with learning rate 0.3. These results seems some how confusing, could you please help me to understand this?
Upvotes: 0
Views: 933
Reputation: 16
Do I really need to perform sc.transform(X_train) and sc.transform(X_test), as my data is already normalized in the dataframe?
The two are vastly different. In your dataframe, what you did is l2-normalization. That is, each row was considered a vector and then l2-norm was reduced to 1. StandardScaler from sklearn does normal -scaling that is for each row, it subtracts its mean and then divides by its variance. If the row-data were from a gaussian distribution, it converts it into a proper Normal distribution with 0-mean and unit variance.
As to what transformations you should do for regression -- I dont think there is any general approach. L2-normalization and standard scaling are general data transformations that may or may not improve the regression performance and can only be answered empirically. Same goes for the question as to whether they should be used in conjunction or only one of them.
Upvotes: 0