Do I have to do fit PCA separately for train and test data

Question

I am considering to do PCA(TruncatedSVD) for reducing the number of dimension for my sparse matrix.

I split my data into train and test split.

X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Do I have to do PCA seperatly for X_train and X_test?

pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.fit_transform(X_test)

or Do I have to fit only on train data and then transform both train and test data. Which is preferred?

pca.fit(X_train)
train = pca.transform(X_train)
test = pca.transform(X_test)

EDIT:

I am doing a classification task. I have a column called project_description from my actual dataset and applied BoW (CountVectorizer) for that column and transformed it into count vectors and then applied PCA on it to reduce dimensions.

My Actual dataset also have other columns such as price, place, date, share% etc...

Now do I have to do apply PCA on my actual dataset(i.e other columns) before concatenating the PCA applied Bow Vector?

Do I have to do fit PCA separately for train and test data

Answers (1)

Related Questions