Reputation: 4349

Pairwise similarities with numpy without for loop

I am trying to calculate similarities between each row using numpy. Would you please advice how this can be done without for loop?

import numpy as np


x = np.array([[1, 2, 3], [4, 5, 6]])     # input:  2 x 3 matrix
similarity_matrix = np.zeros([2, 2])     # output: 2 x 2 matrix

for i, row1 in enumerate(x):
    for j, row2 in enumerate(x):
        similarity_matrix[i, j] = my_similarity_func(row1, row2) # this func returns a scalar

If my input is n x 1 matrix, then this works. Is there a way to achieve this when input is n x m matrix?

x = np.array([1, 2, 3])
similarity_matrix = my_similarity_func(*np.meshgrid(x, x))

*I am aware that there are some libraries to calculate similarities such as sklearn or scipy. Also there exists a fancy linear algebra way. But here I am simply wondering if it is possible to replace this for loop.

Upvotes: 0

Answers (2)

user2699

Reputation: 3147

A couple options have been giving for removing the for loops. Assuming this is due to concerns about efficiency, I've provided some benchmarks. Profiling this sort of thing is very dependent on what the function being called does and how large the array is. Timing several of the methods given here (using np.dot as the similarity function) gives pretty similar results, with the for loop being surprisingly competitive.

%timeit tmp=test_using_for_loop(x)
5.88 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
6.54 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit it = starmap(my_similarity_func, product(x, x)); similarity_matrix = np.fromiter(it, float).reshape((len(x), len(x)))
5.34 µs ± 364 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit dist.cdist(x,x,metric=my_similarity_func)
15 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

On the other hand, the data given is rather small. In many applications it is typical to compute a similarity metric on hundreds or thousands of samples. And after all, why optimize for a 2 by 3 matrix? Using larger data

x = np.random.randn(3000, 150)

The results are

%timeit tmp=test_using_for_loop(x)
5.69 s ± 54.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
5.17 s ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit it = starmap(my_similarity_func, product(x, x)); similarity_matrix = np.fromiter(it, float).reshape((len(x), len(x)))
3.74 s ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit dist.cdist(x,x,metric=my_similarity_func)
8.08 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There's also the question brought up by coldspeed and several other commenters - would it be better to optimize the similarity function rather than how it is called? A custom similarity function will not be as optimized as np.dot. So, using what is a purposefully worst-case (and absolutely useless) similarity function

def my_similarity_func(a,b):
    calc1 = a.dot(b)
    calc2 = sqrt(abs(sum(a)+sum(b)))
    calc3 = calc1**2 / calc2 + 1
    return calc3

What was a fairly substantial difference in performance almost disappears. The percent difference between the itertools method and basic looping is around 5 or 6% (still larger than expected, but not much)

%timeit tmp=test_using_for_loop(x)
1min 11s ± 2.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
1min 7s ± 468 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
1min 7s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit array([[my_similarity_func(r1, r2) for r1 in x] for r2 in x])
1min 8s ± 1.31 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In summary, there are several ways to remove the for loop but performance-wise they will probably all be similar. If performance matters it would be best to re-write the similarity function in a way that can take advantage of broadcasting or other optimizations. Doing that to the worst-case similarity function here reduces the running time to a few hundred milliseconds.

%timeit x.dot(x.T)**2 / sqrt(abs(sum(x, 1)[:,None] + sum(x.T, 0))) + 1
128 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

Oscar Benjamin

Reputation: 14480

You can replace the for-loop using itertools, which may be more efficient (I'm assuming efficiency is your actual goal):

from itertools import product, starmap
it = starmap(my_similarity_func, product(x, x))
similarity_matrix = np.fromiter(it, float).reshape((len(x), len(x)))

Upvotes: 1

Pairwise similarities with numpy without for loop

Answers (2)

Related Questions