Abhishek Gupta
Abhishek Gupta

Reputation: 6615

Train scikit svm one by one (online or stochastic training)

I am using scikit library for using svm. I have huge amount of data which I can't read together to give fit() function.
I want to give iterate over all my data which is in a file and train svm one by one. Is there any way to do this. It is not clear form the documentation and in their tutorial they are giving complete data to fit at once.
Is there any way to train it one by one (means may be something like calling fit for every input pattern of the training data).

Upvotes: 6

Views: 2294

Answers (1)

ogrisel
ogrisel

Reputation: 40149

Support Vector Machine (at least as implemented in libsvm which scikit-learn is a wrapper of) is fundamentally a batch algorithm: it needs to have access to all the data in memory at once. Hence they are not scalable.

Instead you should use models that support incremental learning with the partial_fit method. For instance some linear models such as sklearn.linear_model.SGDClassifier support the partial_fit method. You can slice your dataset and load it as a sequence of minibatches with shape (batch_size, n_features). batch_size can be 1 but is not efficient because the of the python interpreter overhead (+ the data load overhead). So it is recommended to lead samples by minitaches of a least 100.

Upvotes: 15

Related Questions