Reputation: 894
I'm attempting to use sklearn 0.11's LogisticRegression object to fit a model on 200,000 observations with about 80,000 features. The goal is to classify short text descriptions into 1 of 800 classes.
When I attempt to fit the classifier pythonw.exe gives me:
Application Error "The instruction at ... referenced memory at 0x00000000". The memory could not be written".
The features are extremely sparse, about 10 per observation, and are binary (either 1 or 0), so by my back of the envelope calculation my 4 GB of RAM should be able to handle the memory requirements, but that doesn't appear to be the case. The models only fit when I use fewer observations and/or fewer features.
If anything, I would like to use even more observations and features. My naive understanding is that the liblinear library running things behind the scenes is capable of supporting that. Any ideas for how I might squeeze a few more observations in?
My code looks like this:
y_vectorizer = LabelVectorizer(y) # my custom vectorizer for labels
y = y_vectorizer.fit_transform(y)
x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)
clf = LogisticRegression()
clf.fit(x, y)
The features() function I pass to analyzer just returns a list of strings indicating the features detected in each observation.
I'm using Python 2.7, sklearn 0.11, Windows XP with 4 GB of RAM.
Upvotes: 7
Views: 8451
Reputation: 40169
liblinear (the backing implementation of sklearn.linear_model.LogisticRegression
) will host its own copy of the data because it is a C++ library whose internal memory layout cannot be directly mapped onto a pre-allocated sparse matrix in scipy such as scipy.sparse.csr_matrix
or scipy.sparse.csc_matrix
.
In your case I would recommend to load your data as a scipy.sparse.csr_matrix
and feed it to a sklearn.linear_model.SGDClassifier
(with loss='log'
if you want a logistic regression model and the ability to call the predict_proba
method). SGDClassifier
will not copy the input data if it's already using the scipy.sparse.csr_matrix
memory layout.
Expect it to allocate a dense model of 800 * (80000 + 1) * 8 / (1024 ** 2) = 488MB in memory (in addition to the size of your input dataset).
Edit: how to optimize the memory access for your dataset
To free memory after dataset extraction you can:
x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)
from sklearn.externals import joblib
joblib.dump(x.tocsr(), 'dataset.joblib')
Then quit this python process (to force complete memory deallocation) and in a new process:
x_csr = joblib.load('dataset.joblib')
Under linux / OSX you could memory map that even more efficiently with:
x_csr = joblib.load('dataset.joblib', mmap_mode='c')
Upvotes: 25