Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

sklearn kneighbours memory error python

I am working on a Windows 7 8gb RAM.

This is the vectorizer I am using to vectorize a free text column in my 52MB training dataset

vec = CountVectorizer(analyzer='word',stop_words='english',decode_error='ignore',binary=True)

I want to calculate 5 nearest neighbours with this dataset for an 18MB test set.

nbrs = NearestNeighbors(n_neighbors=5).fit(vec.transform(data['clean_sum']))
vectors = vec.transform(data_test['clean_sum'])
distances,indices = nbrs.kneighbors(vectors)

This is the stack trace -

Traceback (most recent call last):
  File "cr_nearness.py", line 224, in <module>
    distances,indices = nbrs.kneighbors(vectors)
  File "C:\Anaconda2\lib\site-packages\sklearn\neighbors\base.py", line 371,
kneighbors
    n_jobs=n_jobs, squared=True)
  File "C:\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 12
in pairwise_distances
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
  File "C:\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 10
in _parallel_pairwise
    return func(X, Y, **kwds)
  File "C:\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 23
n euclidean_distances
    distances = safe_sparse_dot(X, Y.T, dense_output=True)
  File "C:\Anaconda2\lib\site-packages\sklearn\utils\extmath.py", line 181,
afe_sparse_dot
    ret = ret.toarray()
  File "C:\Anaconda2\lib\site-packages\scipy\sparse\compressed.py", line 940
 toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "C:\Anaconda2\lib\site-packages\scipy\sparse\coo.py", line 250, in to
y
    B = self._process_toarray_args(order, out)
  File "C:\Anaconda2\lib\site-packages\scipy\sparse\base.py", line 817, in _
ess_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

Any ideas?

Upvotes: 5

Views: 3608

Answers (1)

AboAli Almusawi
AboAli Almusawi

Reputation: 43

Use KNN with KD TREE

model = KNeighborsClassifier(n_neighbors=5,algorithm='kd_tree').fit(X_train, Y_train)

the model by default is algorithm='brute'. brute false take too much memory. I think for your model it should be look like this

nbrs = NearestNeighbors(n_neighbors=5,algorithm='kd_tree').fit(vec.transform(data['clean_sum']))

Upvotes: 3

Related Questions