Bruno Hanzen
Bruno Hanzen

Reputation: 381

Numpy - Clustering - Distance - Vectorisation

I have clustered a data sample (400 k samples, dimension = 205, 200 clusters) using sklearn Kmeans.

I want to know, for each cluster, the maximum distance between the cluster center and the most distant sample of the cluster, in order to learn something about the "size" of the cluster. Here is my code:

import numpy as np
import scipy.spatial.distance as spd
diam = np.empty([200])
for i in range(200):
    diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max()

"seed" are the cluster centers(200x206). The first column of "seed" contains the number of sample within the cluster (irrelevant here).

"data" are the samples (400kx206). The first column of the data contains the cluster number.

Question: this is done using a loop (not so "numpy"). Is it possible to "vectorise" it?

Upvotes: 4

Views: 398

Answers (3)

Daniel
Daniel

Reputation: 19547

We can be a bit smarter about indexing and save about ~4x in cost.

First lets build some data of the correct shape:

seed = np.random.randint(0, 100, (200,206))
data = np.random.randint(0, 100, (4e5,206))
seed[:, 0] = np.arange(200)
data[:, 0] = np.random.randint(0, 200, 4e5)
diam = np.empty(200)

Time of the original answer:

%%timeit
for i in range(200):
    diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max()

1 loops, best of 3: 1.35 s per loop

moarningsun's answer:

%%timeit
seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)

1 loops, best of 3: 1.33 s per loop

Divakar's answer:

%%timeit
data_sorted = data[data[:, 0].argsort()]
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
diam_out = np.maximum.reduceat(dists,shift_idx)

1 loops, best of 3: 1.65 s per loop

As we can see were not really gaining anything with vectorized solutions besides a larger memory footprint. To avoid this we need to return to the original answer, which is really the correct way to do these things, and instead attempt to reduce the amount of indexing:

%%timeit
idx = data[:,0].argsort()
bins = np.bincount(data[:,0])
counter = 0
for i in range(200):
    data_slice = idx[counter: counter+bins[i]]
    diam[i] = spd.cdist(seed[None, i, 1:], data[data_slice, 1:]).max()
    counter += bins[i]

1 loops, best of 3: 281 ms per loop

Double check the answer:

np.allclose(diam, dam_out)
True

This is the problem with assuming python loops are bad. They often are, but not in all situations.

Upvotes: 2

Divakar
Divakar

Reputation: 221754

Here's one vectorized approach -

# Sort data w.r.t. col-0
data_sorted = data[data[:, 0].argsort()]

# Get counts of unique tags in col-0 of data and repeat seed accordingly. 
# Thus, we would have an extended version of seed that matches data's shape.
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)

# Get euclidean distances between extended seed version and sorted data
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))

# Get positions of shifts in col-0 of sorted data
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)

# Final piece of puzzle is to get tag based maximum values from dists, 
# where each tag is unique number in col-0 of data
diam_out = np.maximum.reduceat(dists,shift_idx)

Runtime tests and verify output -

Define functions :

def loopy_cdist(seed,data):  # from OP's solution
    N = seed.shape[0]
    diam = np.empty(N)
    for i in range(N):
        diam[i]=spd.cdist(seed[np.newaxis,i,1:], data[data[:,0]==i][:,1:]).max()
    return diam

def vectorized_repeat_reduceat(seed,data):  # from this solution
    data_sorted = data[data[:, 0].argsort()]
    seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
    dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
    shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
    return np.maximum.reduceat(dists,shift_idx)

def vectorized_indexing_maxat(seed,data): # from @moarningsun's solution
    seed_repeated = seed[data[:,0]]
    dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
    diam = np.zeros(len(seed))
    np.maximum.at(diam, data[:,0], dist_to_center)
    return diam

Verify outputs :

In [417]: # Inputs
     ...: seed = np.random.rand(20,20)
     ...: data = np.random.randint(0,20,(40000,20))
     ...: 

In [418]: np.allclose(loopy_cdist(seed,data),vectorized_repeat_reduceat(seed,data))
Out[418]: True

In [419]: np.allclose(loopy_cdist(seed,data),vectorized_indexing_maxat(seed,data))
Out[419]: True

Runtimes :

In [420]: %timeit loopy_cdist(seed,data)
10 loops, best of 3: 35.9 ms per loop

In [421]: %timeit vectorized_repeat_reduceat(seed,data)
10 loops, best of 3: 28.9 ms per loop

In [422]: %timeit vectorized_indexing_maxat(seed,data)
10 loops, best of 3: 24.1 ms per loop

Upvotes: 1

user2379410
user2379410

Reputation:

Very similar idea as @Divakar, but without having to sort:

seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))

diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)

ufunc.at is known to be slow though, so it would be interesting to see which is faster.

Upvotes: 1

Related Questions