user2115183
user2115183

Reputation: 851

Python Memory Error - Sklearn Huge Input Data?

I need to train the svm classifier in sklearn. The dimensions of the feature vectors go in hundreds of thousands and there are tens of thousands of such feature vectors. However, each dimension can be 0, 1 or -1. Only some 100 are non-zero in each feature vector. Any efficient way to give the info about the feature vectors to the classifier?

Upvotes: 1

Views: 2681

Answers (1)

ogrisel
ogrisel

Reputation: 40169

I need to train the svm classifier in sklearn.

You mean sklearn.svm.SVC? For high dimensional sparse data and many samples, LinearSVC, LogisticRegression, PassiveAggressiveClassifier or SGDClassifier can be much faster to train for comparable predictive accuracy.

The dimensions of the feature vectors go in lakhs and there are tens of thousands of such feature vectors. However, each dimension can be 0, 1 or -1. Only some 100 are non-zero in each feature vector. Any efficient way to give the info about the feature vectors to the classifier?

Find a way to load your data as a scipy.sparse matrix that does not store the zeros in memory. Have a look at the documentation on feature extraction. It will give you tools to do that depending on the nature of the representation of the original data.

Upvotes: 2

Related Questions