Reputation: 515
I am using Sci-Kit learn's svm library for classifying images. I was wondering when I fit the testing data does it work sequentially or does it erase the previous classification material and re-fit to the new testing data. For example if I fit 100 images to the classifier can I go ahead and then sequentially fit another 100 images or will the SVM delete the work it performed on the original 100 images. This is difficult to explain for me so I'll provide and example:
In order to fit a SVM classifier to 200 images can I do this:
clf=SVC(kernel='linear')
clf.fit(test.data[0:100], test.target[0:100])
clf.fit(test.data[100:200], test.target[100:200])
Or must I do this:
clf=SVC(kernel='linear')
clf.fit(test.data[:200], test.target[:200])
I am wondering only because I run into memory errors when trying to use .fit(X, y) with too many images at once. So is it possible to use fit sequentially and "increment" my classifier upwards so that it is techincally trained on 10000 images but only 100 at a time.
If this is possible please confirm and explain? And if its not possible please explain?
Upvotes: 2
Views: 3257
Reputation: 9390
http://scikit-learn.org/stable/developers/index.html#estimated-attributes
The last-mentioned attributes are expected to be overridden when you call fit a second time without taking any previous value into account: fit should be idempotent.
https://en.wikipedia.org/wiki/Idempotent
So yes, second call will erase old model and compute new one. You can check it by yourself if you understand python code. For example in sklearn/svm/classes.py
I think you need minibatch training, but i don't see partial_fit implementation for SVM, maybe it's because scikit-learn team recommend SGDClassifier and SGDRegressor for dataset with size more than 100k samples. http://scikit-learn.org/stable/tutorial/machine_learning_map/, try to use them with minibatch as described here.
Upvotes: 3