Reputation: 381
I have clustered a data sample (400 k samples, dimension = 205, 200 clusters) using sklearn Kmeans.
I want to know, for each cluster, the maximum distance between the cluster center and the most distant sample of the cluster, in order to learn something about the "size" of the cluster. Here is my code:
import numpy as np
import scipy.spatial.distance as spd
diam = np.empty([200])
for i in range(200):
diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max()
"seed" are the cluster centers(200x206). The first column of "seed" contains the number of sample within the cluster (irrelevant here).
"data" are the samples (400kx206). The first column of the data contains the cluster number.
Question: this is done using a loop (not so "numpy"). Is it possible to "vectorise" it?
Upvotes: 4
Views: 398
Reputation: 19547
We can be a bit smarter about indexing and save about ~4x in cost.
First lets build some data of the correct shape:
seed = np.random.randint(0, 100, (200,206))
data = np.random.randint(0, 100, (4e5,206))
seed[:, 0] = np.arange(200)
data[:, 0] = np.random.randint(0, 200, 4e5)
diam = np.empty(200)
Time of the original answer:
%%timeit
for i in range(200):
diam[i] = spd.cdist(seed[np.newaxis, i, 1:], data[data[:, 0]==i][:,1:]).max()
1 loops, best of 3: 1.35 s per loop
moarningsun's answer:
%%timeit
seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)
1 loops, best of 3: 1.33 s per loop
Divakar's answer:
%%timeit
data_sorted = data[data[:, 0].argsort()]
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
diam_out = np.maximum.reduceat(dists,shift_idx)
1 loops, best of 3: 1.65 s per loop
As we can see were not really gaining anything with vectorized solutions besides a larger memory footprint. To avoid this we need to return to the original answer, which is really the correct way to do these things, and instead attempt to reduce the amount of indexing:
%%timeit
idx = data[:,0].argsort()
bins = np.bincount(data[:,0])
counter = 0
for i in range(200):
data_slice = idx[counter: counter+bins[i]]
diam[i] = spd.cdist(seed[None, i, 1:], data[data_slice, 1:]).max()
counter += bins[i]
1 loops, best of 3: 281 ms per loop
Double check the answer:
np.allclose(diam, dam_out)
True
This is the problem with assuming python loops are bad. They often are, but not in all situations.
Upvotes: 2
Reputation: 221754
Here's one vectorized approach -
# Sort data w.r.t. col-0
data_sorted = data[data[:, 0].argsort()]
# Get counts of unique tags in col-0 of data and repeat seed accordingly.
# Thus, we would have an extended version of seed that matches data's shape.
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
# Get euclidean distances between extended seed version and sorted data
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
# Get positions of shifts in col-0 of sorted data
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
# Final piece of puzzle is to get tag based maximum values from dists,
# where each tag is unique number in col-0 of data
diam_out = np.maximum.reduceat(dists,shift_idx)
Runtime tests and verify output -
Define functions :
def loopy_cdist(seed,data): # from OP's solution
N = seed.shape[0]
diam = np.empty(N)
for i in range(N):
diam[i]=spd.cdist(seed[np.newaxis,i,1:], data[data[:,0]==i][:,1:]).max()
return diam
def vectorized_repeat_reduceat(seed,data): # from this solution
data_sorted = data[data[:, 0].argsort()]
seed_ext = np.repeat(seed,np.bincount(data_sorted[:,0]),axis=0)
dists = np.sqrt(((data_sorted[:,1:] - seed_ext[:,1:])**2).sum(1))
shift_idx = np.append(0,np.nonzero(np.diff(data_sorted[:,0]))[0]+1)
return np.maximum.reduceat(dists,shift_idx)
def vectorized_indexing_maxat(seed,data): # from @moarningsun's solution
seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)
return diam
Verify outputs :
In [417]: # Inputs
...: seed = np.random.rand(20,20)
...: data = np.random.randint(0,20,(40000,20))
...:
In [418]: np.allclose(loopy_cdist(seed,data),vectorized_repeat_reduceat(seed,data))
Out[418]: True
In [419]: np.allclose(loopy_cdist(seed,data),vectorized_indexing_maxat(seed,data))
Out[419]: True
Runtimes :
In [420]: %timeit loopy_cdist(seed,data)
10 loops, best of 3: 35.9 ms per loop
In [421]: %timeit vectorized_repeat_reduceat(seed,data)
10 loops, best of 3: 28.9 ms per loop
In [422]: %timeit vectorized_indexing_maxat(seed,data)
10 loops, best of 3: 24.1 ms per loop
Upvotes: 1
Reputation:
Very similar idea as @Divakar, but without having to sort:
seed_repeated = seed[data[:,0]]
dist_to_center = np.sqrt(np.sum((data[:,1:]-seed_repeated[:,1:])**2, axis=1))
diam = np.zeros(len(seed))
np.maximum.at(diam, data[:,0], dist_to_center)
ufunc.at
is known to be slow though, so it would be interesting to see which is faster.
Upvotes: 1