ng0323
ng0323

Reputation: 367

How to overcome SVM memory requirement

I am using the SVM function (LinearSVC) in scikit-learn. My dataset and number of features is quite large, but my PC RAM is insufficient, which causes swapping, slowing things down. Please suggest how I can deal with this (besides increasing RAM).

Upvotes: 7

Views: 4295

Answers (1)

jakevdp
jakevdp

Reputation: 86320

In short, without reducing the size of your data or increasing the RAM on your machine, you will not be able to use SVC here. As implemented in scikit-learn (via libsvm wrappers) the algorithm requires seeing all the data at once.

One option for larger datasets is to move to a model that allows online fitting, via the partial_fit() method. One example of an online algorithm that is very close to SVC is the Stochastic Gradient Descent Classifier, implemented in sklearn.linear_model.SGDClassifier. Through its partial_fit method, you can fit your data just a bit at a time, and not encounter the sort of memory issues that you might see in a one-batch algorithm like SVC. Here's an example:

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_blobs

# make some fake data
X, y = make_blobs(n_samples=1000010,
                  random_state=0)

# train on a subset of the data at a time
clf = SGDClassifier()
for i in range(10):
    subset = slice(100000 * i, 100000 * (i + 1))
    clf.partial_fit(X[subset], y[subset], classes=np.unique(y))

# predict on unseen data
y_pred = clf.predict(X[-10:])

print(y_pred)
# [2 0 1 2 2 2 1 0 1 1]

print(y[-10:])
# [2 0 1 2 2 2 1 0 1 1]

For more information on using scikit-learn for large datasets, you can check out the Strategies to scale computationally: bigger data page in the sklearn docs.

Upvotes: 4

Related Questions