Sanathana
Sanathana

Reputation: 284

How to resolve setting an array element with a sequence error while performing K-Means clustering?

Hello everyone, I am performing k-means clustering for data within a text file which as about 50k samples and each sample is of 128 dimension.

Example of my input:

[1,1,0,0,0,0,1,0,24,3,0,0,0,0,86,149,149,14,0,0,0,0,32,149,46,16,0,0,1,13,3,33,65,66,0,0,0,0,0,2,149,140,6,0,0,2,62,148,88,24,26,2,0,14,116,148,30,15,1,0,0,1,5,30,56,18,0,0,0,0,0,4,149,46,40,14,0,0,1,34,31,46,149,31,0,2,9,12,1,7,8,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,12,2,0,0,0,0,0,0,0,0,0,0,0,0]

(likewise 50k samples)

When I use say about 20-30 lines of this input in this code,

from sklearn.cluster import MiniBatchKMeans
import numpy 
import csv

f = open("sample_input.txt", "r") 
out = [eval(arr) for arr in f.readlines()]


mbkm = MiniBatchKMeans(init='k-means++', n_clusters=50, batch_size=50,
                       n_init=10, max_no_improvement=10, verbose=0)
mbkm.fit(out)
mbk_means_cluster_centers = mbkm.cluster_centers_

numpy.set_printoptions(threshold=numpy.nan)
print mbk_means_cluster_centers

I get the output. But when I use the entire file (Be it in text or csv extension), I get the error " setting an array element with a sequence".

When my code is working for 20-30 lines why is it not working for 50k lines of input? I assume the csv conversion of text file is just by renaming the file with .csv extension.

The main doubt is how to get this code running for 50k lines of input? Only when this is resolved, I can run it for another data which has about 3,00,000 lines of input. Please help. Thanks in advance!

PS: I am coding in python 2.7 in ubuntu platform.

Upvotes: 1

Views: 489

Answers (1)

Jamie Bull
Jamie Bull

Reputation: 13529

It looks like you have two or more lists on a line somewhere meaning you're trying to evaluate two or more arrays (a sequence) as a single array. When I test this with two arrays separated by a comma then I get the same error as you.

Try this to find the error:

f = open("sample_input.txt", "r") 
n = 1
for line in f.readlines():
    if len(eval(line)) is not 128:
        print "Error is on line %s" % n
    n += 1

Otherwise, I suggested "divide and conquer". If you split the data in half and there's a problem in one half, split that again and keep going until you have only a small chunk of file with the problem. The problem may be in more than one place, which means it could take a while but it still seems like the best way to approach the problem if it's not what I suggested.

Upvotes: 2

Related Questions