Reputation: 303
I have transformed my dataset (with 9 columns) using power transformer to produce a gaussian distribution with standardization.
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson',standardize=True)
#you can get the original data back using inverse_transform(X)
X_train=pt.fit_transform(X_train)
#fit the model only on the train set and transform the test set
X_test=pt.transform(X_test)
So now my dataset has almost a gaussian distribution for most features with zero mean and unit variance. Then I applied PolynomialFeatures():
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 4)
X_poly = poly.fit_transform(X_train)
LR2 = LinearRegression()
LR2.fit(X_poly, y_train)
After adding polynomial features, I have 2380 columns which can cause over fitting so I wanted to use PCA for dimensionality reduction but I read somewhere that PCA needs data to be "scaled" (which generally means to change the range of the values using something like MinMaxScaler() ).
So should I use MinMaxScaler() before applying PCA to the boxcox transformed (and standardized) dataset?
Upvotes: 2
Views: 1224
Reputation: 19307
Standardization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions that maximize the variance. The first plot below shows the amount of total variance explained in the different principal components where we have not normalized the data. As you can see, it seems like component one explains most of the variance in the data.
Find more details here
In your case, you are using the power transform with Standardization (setting mean and std to 0 and 1), set to True
. Normalization (setting variable range between 0 to 1) is usually not prefered before PCA because it doesn't do much in terms of handling the existing skewness of the data and outliers.
Check this.
So I would recommend that there is no need for a Min Max Scaler if your features are already standardized.
Upvotes: 1