Reputation: 367
I am using the SVM function (LinearSVC) in scikit-learn. My dataset and number of features is quite large, but my PC RAM is insufficient, which causes swapping, slowing things down. Please suggest how I can deal with this (besides increasing RAM).
Upvotes: 7
Views: 4295
Reputation: 86320
In short, without reducing the size of your data or increasing the RAM on your machine, you will not be able to use SVC
here. As implemented in scikit-learn (via libsvm wrappers) the algorithm requires seeing all the data at once.
One option for larger datasets is to move to a model that allows online fitting, via the partial_fit()
method. One example of an online algorithm that is very close to SVC is the Stochastic Gradient Descent Classifier, implemented in sklearn.linear_model.SGDClassifier
. Through its partial_fit
method, you can fit your data just a bit at a time, and not encounter the sort of memory issues that you might see in a one-batch algorithm like SVC
. Here's an example:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_blobs
# make some fake data
X, y = make_blobs(n_samples=1000010,
random_state=0)
# train on a subset of the data at a time
clf = SGDClassifier()
for i in range(10):
subset = slice(100000 * i, 100000 * (i + 1))
clf.partial_fit(X[subset], y[subset], classes=np.unique(y))
# predict on unseen data
y_pred = clf.predict(X[-10:])
print(y_pred)
# [2 0 1 2 2 2 1 0 1 1]
print(y[-10:])
# [2 0 1 2 2 2 1 0 1 1]
For more information on using scikit-learn for large datasets, you can check out the Strategies to scale computationally: bigger data page in the sklearn docs.
Upvotes: 4