kamalbanga
kamalbanga

Reputation: 2011

Using scipy's kmeans2 function in python

I found this example for using kmeans2 algorithm in python. I can't get the following part

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

# whiten them
z = whiten(z)

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

The points are zip(xy[:,0],xy[:,1]), so what is the third value z doing here?

Also what is whitening?

Any explanation is appreciated. Thanks.

Upvotes: 5

Views: 3544

Answers (1)

askewchan
askewchan

Reputation: 46530

First:

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

The weirdest thing about this is that it's equivalent to:

z = numpy.sin(0.8*xy[:, 1])

So I don't know why it's written that way. maybe there's a typo?

Next,

# whiten them
z = whiten(z)

whitening is simply normalizing the variance of the population. See here for a demo:

>>> z = np.sin(.8*xy[:, 1])      # the original z
>>> zw = vq.whiten(z)            # save it under a different name
>>> zn = z / z.std()             # make another 'normalized' array
>>> map(np.std, [z, zw, zn])     # standard deviations of the three arrays
[0.42645, 1.0, 1.0]
>>> np.allclose(zw, zn)          # whitened is the same as normalized
True

It's not obvious to me why it is whitened. Anyway, moving along:

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

Let's break that into two parts:

data = np.array(zip(xy[:, 0], xy[:, 1], z))

which is a weird (and slow) way of writing

data = np.column_stack([xy, z])

In any case, you started with two arrays and merge them into one:

>>> xy.shape
(30, 2)
>>> z.shape
(30,)
>>> data.shape
(30, 3)

Then it's data that is passed to the kmeans algorithm:

res, idx = vq.kmeans2(data, 3)

So now you can see that it's 30 points in 3d space that are passed to the algorithm, and the confusing part is how the set of points were created.

Upvotes: 9

Related Questions