Reputation: 25
I recently started working in the field of machine learning and stuff related to it using python. Today I'm working on a dataset where I would like to apply a dimension reduction and apply my model to evaluate the score. This dataset got 30 features.
I start with a simple algorithm which is the Logistic Regression but before applying my logistic regression I want to do a PCA. To determine which number of components is the best I used the gridsearchCV with my logistic regression only playing with the C parameter and my PCA where I choose the number of components. The result I got is that the more components I use for my PCA the better is the precision score. For my example with n_components=30 I get a precision score of 0.81.
The problem is that I thought PCA is used for dimension reduction (i.e working with fewer features) and that it could help increasing score. Is there something I do not understand?
pca = PCA()
logistic = LogisticRegression()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
param_grid = {
'pca__n_components': [5,10,15,20,25,30],
'logistic__C': [0.01,0.1,1,10,100]
}
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='precision') # fix adding a tuple scoring
search.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
results = pd.DataFrame(search.cv_results_)
output : Best parameter (CV score=0.881): {'logistic__C': 0.01, 'pca__n_components': 30}
Thanks in advance for your reply
EDIT: I add this screenshot for more information on the score with number of components
Upvotes: 0
Views: 525
Reputation: 998
In general, when you do dimension reduction, you lose some information. It is not surprising then that you get a higher score with the full set of PCA features. Working with few features could indeed help increase the score but not necessarily, there are also other good reasons for using PCA for dimension reduction. Here are the main advantages of PCA:
PCA is one good technique for dimension reduction (with its own limitations) in the sense that it concentrate the variance of the dataset in the first dimensions of the computed new space. Hence, dropping the last features is done at a minimal cost in terms of information carried by the dataset (under certain hypotheses). Using PCA for dimension reduction mitigates the risk of overfitting by limiting the number of features, while losing a minimal amount of information. In this sense, less features can increase the score by avoiding overfitting but that is not always true.
Dimension reduction with PCA can also be useful when working with noisy data. PCA will not directly eliminate the noise, but the first few features will have a higher signal-to-noise ratio since the variance of the dataset is concentrated there. The last features may be then dominated by noise and dropped.
Since PCA projects the dataset on a new orthonormal basis, the new features will be all independant from each other. This property is often required by a lot of machine learning algorithms to achieve optimal performance.
Of course, PCA should not be used in any case as it has its own hypotheses and limitations. Here are what I consider the main ones (non exhaustive):
PCA is sensitive to the scaling of the variables. As an example, if you have a temperature
column in your dataset, you will get a different transformation depending on whether you use Celsius or Fahrenheit as the unit because their scale are different. When the variables have different scales, PCA is a bit arbitrary. This can be corrected by scaling all variables to unit variance, but at the cost of modifying (compressing or expanding) the fluctuations of the variables in all dimensions.
PCA captures linear correlations between between the features but fails to capture non-linear correlations.
What would be interesting in your case would be to compare the score obtained with and without the PCA transformation. You would see then if there is a benefit in using it.
Last but not least, your plot shows an interesting thing. The gain in the score between 20 and 30 features is very low (1% ?). You can wonder whether it is worth keeping ten additional features for this very low gain. Indeed, keeping more features increases the risk of having a model with a lower ability to generalize. Cross validation mitigates already this risk, but there are no guarantees that when you apply the model on unseen data, this unseen data will have the exact same properties as your training dataset.
Upvotes: 0