Reputation: 2308
I am considering to do PCA(TruncatedSVD) for reducing the number of dimension for my sparse matrix.
I split my data into train and test split.
X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Do I have to do PCA seperatly for X_train and X_test?
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.fit_transform(X_test)
or Do I have to fit only on train data and then transform both train and test data. Which is preferred?
pca.fit(X_train)
train = pca.transform(X_train)
test = pca.transform(X_test)
EDIT:
I am doing a classification task. I have a column called project_description
from my actual dataset and applied BoW (CountVectorizer) for that column and transformed it into count vectors and then applied PCA on it to reduce dimensions.
My Actual dataset also have other columns such as price, place, date, share% etc...
Now do I have to do apply PCA on my actual dataset(i.e other columns) before concatenating the PCA applied Bow Vector?
Upvotes: 3
Views: 5292
Reputation: 537
You should not do any preprocessing methods like dimension reduction, or normalization by using the whole dataset, Therefore:
At first, you should split the dataset,
Then you can standardize (or normalize based on your condition) the dataset by using only the train set
After that, you can use the fitted scaler to transform the test set as well.
If you want to apply dimension reduction methods like PCA, now you should:
Perform PCA only on train set
Then transform the test set as well
(So only the second code that was mentioned in the question is correct.)
In this way, later we can use the test set to evaluate our model on unseen data points to see whether the model generalized well or not.
# Split the dataset
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_array,
y_array,
test_size=0.3,
random_state=42)
# Standardize the dataset
scaler = preprocessing.StandardScaler()
# Fit on the train set only
scaler.fit(X_train)
# Apply to both the train set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Apply PCA
pca = PCA()
# Fit on the train set only
pca.fit(X_train)
# Apply transform to both the train set and the test set.
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
Upvotes: 0