Python Memory Error - Sklearn Huge Input Data?

Question

I need to train the svm classifier in sklearn. The dimensions of the feature vectors go in hundreds of thousands and there are tens of thousands of such feature vectors. However, each dimension can be 0, 1 or -1. Only some 100 are non-zero in each feature vector. Any efficient way to give the info about the feature vectors to the classifier?

ogrisel · Accepted Answer

I need to train the svm classifier in sklearn.

You mean sklearn.svm.SVC? For high dimensional sparse data and many samples, LinearSVC, LogisticRegression, PassiveAggressiveClassifier or SGDClassifier can be much faster to train for comparable predictive accuracy.

The dimensions of the feature vectors go in lakhs and there are tens of thousands of such feature vectors. However, each dimension can be 0, 1 or -1. Only some 100 are non-zero in each feature vector. Any efficient way to give the info about the feature vectors to the classifier?

Find a way to load your data as a scipy.sparse matrix that does not store the zeros in memory. Have a look at the documentation on feature extraction. It will give you tools to do that depending on the nature of the representation of the original data.

Python Memory Error - Sklearn Huge Input Data?

Answers (1)

Related Questions