Reputation: 5
I have a list with descriptors (arrays) and need to compute the cosine distance between then, which is a lot of computation. For a list with n
vectors it does n * (n - 1) / 2
operations. I need to do that in a big n
value, so, how can I speed up this process using multiprocessing? I found some answers but is not clear for me how to do in my case.
The code that I want to speed up is the code below:
from scipy.spatial.distance import cosine as cosine_distance
n = len(all_vectors)
all_distances = []
for i in range(n):
for j in range(i + 1, n):
x1 = all_vectors[i]
x2 = all_vectors[j]
all_distances.append(cosine_distance(x1, x2))
[UPDATE]
I also need to do a little label check, so, the original code is the following:
from scipy.spatial.distance import cosine as cosine_distance
n = len(all_vectors)
all_distances = []
all_label_checks = []
for i in range(n):
for j in range(i + 1, n):
x1 = all_vectors[i]
x2 = all_vectors[j]
all_distances.append(cosine_distance(x1, x2))
label_x1 = all_labels[i] # all_labels and all_distances are tied
label_x2 = all_labels[j]
all_label_checks.append(int(label_x1 == label_x2))
I tested the time of some suggestion, and the best answer so far is from @DaniMesejo. Here is the code that I used to test each case:
import time
import numpy as np
from scipy.spatial.distance import pdist, cosine as cosine_distance, squareform
from itertools import combinations, starmap
EMBEDDINGS_SIZE = 128
N_EMBEDDINGS = 1024
def get_n_embeddings(n):
return np.random.randn(n, EMBEDDINGS_SIZE)
embeddings = get_n_embeddings(N_EMBEDDINGS)
###################
# scipy pdist
###################
start = time.time()
pdist_distances = pdist(embeddings, 'cosine')
end = time.time()
print(f"Using scipy pdsit: {end - start}")
###################
# nested loop
###################
nested_distances = []
start = time.time()
for i in range(N_EMBEDDINGS):
for j in range(i + 1, N_EMBEDDINGS):
x1 = embeddings[i]
x2 = embeddings[j]
nested_distances.append(cosine_distance(x1, x2))
end = time.time()
print(f"Using nested loop: {end - start}")
###################
# combinations
###################
start = time.time()
combination_distances = []
for x1, x2 in combinations(embeddings, 2):
combination_distances.append(cosine_distance(x1, x2))
end = time.time()
print(f"Using combination: {end - start}")
###################
# starmap
###################
start = time.time()
starmap_distances = list(starmap(cosine_distance, combinations(embeddings, 2)))
end = time.time()
print(f"Using starmap: {end - start}")
print(np.allclose(pdist_distances, nested_distances))
print(np.allclose(pdist_distances, pdist_class_distances))
print(np.allclose(pdist_distances, combination_distances))
print(np.allclose(pdist_distances, starmap_distances))
And the results:
Using scipy pdsit: 0.0303647518157959
Using pdsit and class: 0.09841656684875488
Using nested loop: 13.415924549102783
Using combination: 13.093504428863525
Using starmap: 13.177483081817627
True
True
True
True
To solve my label problem I also can use pdist
. I just need to transform my label list (that tells what is the label of each embedding) in a list of 2d and compute the pairwise distance. Pairs of the same label will have distance 0:
###################
# pdist and class
###################
classes = []
count = 0
id_ = 0
for i in range(N_EMBEDDINGS):
classes.append(id_)
count += 1
if count % 4 == 0:
count = 0
id_ += 1
labels_x = [(i, i) for i in classes]
start = time.time()
pdist_class_distances = pdist(embeddings, 'cosine')
pdist_labels = pdist(labels_x, 'euclidean')
pdist_labels = [1 if x == 0.0 else 0 for x in pdist_labels]
end = time.time()
print(f"Using pdsit and class: {end - start}")
Upvotes: 0
Views: 188
Reputation: 61930
A possible approach is to use the scipy function pdist, for example:
from scipy.spatial.distance import pdist, cosine as cosine_distance, squareform
import numpy as np
def all_pairs_using_pdist(X):
"""Function using pdist"""
distances = squareform(pdist(X, 'cosine'))
rows, _ = distances.shape
return list(distances[np.triu_indices(rows, 1)])
def all_pairs_for_loops(all_vectors):
"""Function using a nested for loop"""
n = len(all_vectors)
all_distances = []
for i in range(n):
for j in range(i + 1, n):
x1 = all_vectors[i]
x2 = all_vectors[j]
all_distances.append(cosine_distance(x1, x2))
return all_distances
For a matrix like the following:
Y = np.random.randn(100, 25)
It gives the timings:
%timeit all_pairs_using_pdist(Y)
630 µs ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit all_pairs_for_loops(Y)
216 ms ± 78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
About 300 times faster. And of course both approaches, yield the same result:
Y = np.random.randn(100, 25)
print(np.allclose(all_pairs_for_loops(Y), all_pairs_using_pdist(Y)))
Output
True
Upvotes: 2
Reputation: 155674
How much this will help I can't say, but your nested loop is basically an attempt to make all unique 2-combinations of legal indices to get all unique 2-combinations of the values. But there are built-ins to do this, so with a from itertools import combinations
, this:
for i in range(n):
for j in range(i + 1, n):
x1 = all_vectors[i]
x2 = all_vectors[j]
could simplify to this:
for x1, x2 in combinations(all_vectors, 2):
You could even layer all this over itertools.starmap
to push the entire loop to the C layer (assuming cosine_distance
is itself implemented in C; if not, it just limits the upside, but still works fine), simplifying all of your posted code to:
from scipy.spatial.distance import cosine as cosine_distance
from itertools import combinations, starmap
all_distances = list(starmap(cosine_distance, combinations(all_vectors, 2)))
and running faster to boot (meaningfully faster if cosine_distance
is a relatively low cost operation implemented in C).
If that's not fast enough, odds are you'll have to use multiprocessing
/concurrent.futures.ProcessPoolExecutor
to parallelize (maybe multiprocessing.dummy
/concurrent.futures.ThreadPoolExecutor
if the scipy
function is implemented in C and releases the GIL while running, allowing true thread parallelism, which is normally not available to CPU bound threads on CPython due to the GIL).
Upvotes: 0