somedude1234
somedude1234

Reputation: 55

Getting a memory error when using sklearn.cluster Kmeans

As the title states I'm getting a memory error when I try use kmeans.fit().

The data set I'm using has size:

print(np.size(np_list)): 1248680000
print(np_list.shape): (31217, 40000)

My code, I'm running that gives me a memory error is:

with open('np_array.pickle', 'rb') as handle:
    np_list = pickle.load(handle)


kmeans = KMeans(n_clusters=5)
kmeans.fit(np_list)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

I'm working with a data set of 32k images each of which are black and white and were originally 200x200. I turned the 200x200 dimension into a single dimension of 40k in row major order.

Description of traceback:

Traceback (most recent call last):
  File "C:/Project/ML_Clustering.py", line 54, in <module>
    kmeans.fit(np_list)
  File "C:\Users\me\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\k_means_.py", line 896, in fit
    return_n_iter=True)
  File "C:\Users\me\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\k_means_.py", line 283, in k_means
    X = as_float_array(X, copy=copy_x)
  File "C:\Users\me\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 88, in as_float_array
    return X.copy('F' if X.flags['F_CONTIGUOUS'] else 'C') if copy else X
MemoryError

Upvotes: 1

Views: 3697

Answers (1)

Anubhav Singh
Anubhav Singh

Reputation: 8699

The classic implementation of the KMeans clustering method based on the Lloyd's algorithm. It consumes the whole set of input data at each iteration. You can try sklearn.cluster.MiniBatchKMeans that does incremental updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k), MiniBatchKMeans is probably much faster than the default batch implementation.

from sklearn.cluster import MiniBatchKMeans

with open('np_array.pickle', 'rb') as handle:
     np_list = pickle.load(handle)

mbk = MiniBatchKMeans(init ='k-means++', n_clusters = 5, 
                      batch_size = 200, 
                      max_no_improvement = 10, verbose = 0) 

mbk.fit(np_list)

Read more about MiniBatchKMeans from here.

Upvotes: 2

Related Questions