Anjith
Anjith

Reputation: 2308

Do I have to do fit PCA separately for train and test data

I am considering to do PCA(TruncatedSVD) for reducing the number of dimension for my sparse matrix.

I split my data into train and test split.

X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Do I have to do PCA seperatly for X_train and X_test?

pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.fit_transform(X_test)

or Do I have to fit only on train data and then transform both train and test data. Which is preferred?

pca.fit(X_train)
train = pca.transform(X_train)
test = pca.transform(X_test)

EDIT:

I am doing a classification task. I have a column called project_description from my actual dataset and applied BoW (CountVectorizer) for that column and transformed it into count vectors and then applied PCA on it to reduce dimensions.

My Actual dataset also have other columns such as price, place, date, share% etc...

Now do I have to do apply PCA on my actual dataset(i.e other columns) before concatenating the PCA applied Bow Vector?

Upvotes: 3

Views: 5292

Answers (1)

Sara
Sara

Reputation: 537

You should not do any preprocessing methods like dimension reduction, or normalization by using the whole dataset, Therefore:

  • At first, you should split the dataset,

  • Then you can standardize (or normalize based on your condition) the dataset by using only the train set

  • After that, you can use the fitted scaler to transform the test set as well.
    If you want to apply dimension reduction methods like PCA, now you should:

  • Perform PCA only on train set

  • Then transform the test set as well (So only the second code that was mentioned in the question is correct.)
    In this way, later we can use the test set to evaluate our model on unseen data points to see whether the model generalized well or not.

      # Split the dataset
      X_train, X_test, y_train, y_test = model_selection.train_test_split(X_array, 
                                                                   y_array, 
                                                                   test_size=0.3, 
                                                                  random_state=42)  
      # Standardize the dataset                                             
      scaler = preprocessing.StandardScaler()
      # Fit on the train set only
       scaler.fit(X_train)
       # Apply to both the train set and the test set. 
       X_train = scaler.transform(X_train)
       X_test = scaler.transform(X_test)
       # Apply PCA
       pca = PCA()
       # Fit on the train set only
       pca.fit(X_train)
       # Apply transform to both the train set and the test set. 
       X_train = pca.transform(X_train)
       X_test = pca.transform(X_test)
    

Upvotes: 0

Related Questions