PyRsquared
PyRsquared

Reputation: 7338

Vectorize hellinger for sparse matrix - NumPy / Python

I am trying to find the Hellinger distance between a single distribution p and every row of a sparse matrix dist_mat. I want to return a vector of dimension 1*N where N is the number of rows in dist_mat.

def hellinger(p, dist_mat):
    return np.sqrt(1/2) * np.sqrt(  np.sum((np.sqrt(p) - np.sqrt(dist_mat))**2)  )

Using the function above, if we try out a test case:

row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csr_matrix((data, (row, col)), shape=(3, 3)).toarray()
test = np.array([0,21,0])
hellinger(test,csr_matrix((data, (row, col)), shape=(3, 3)))
>>> 4.3633103660024926

which returns a scalar, not a vector. So for the example above I want a list of results containing the hellinger distances. Something like:

hellinger(test,csr_matrix((data, (row, col)), shape=(3, 3)))
>>> [3.46,3.46,2.78] # hellinger distance between test and each row of the csr sparse matrix

Is there some way I can return the desired vector of distances using numpy notation, perhaps using the np.apply_along_axis method? I have seen this done before, but can't seem to get it here. Thanks in advance.

NOTE: I want to avoid explicit for loops as these would be inefficient. I am looking for the most optimized / fastest way to do this.

Upvotes: 3

Views: 716

Answers (2)

PyRsquared
PyRsquared

Reputation: 7338

Basing my answer on @Divakar's...

If the distributions $P$ and $Q$ that we are inputting into the Hellinger function are normalized (so $\sum_i P_i = 1$) (which they are in my case), then the function simplifies to

$\sqrt{1 - \sum_i^k \sqrt{P_iQ_i}}$

This works even if $Q$ is a matrix of row wise distribution vectors.

So we can write

def hellinger(p,dist_mat): 
    out = np.sqrt(1 - np.sqrt(dist_mat*p.T).toarray())
    return out.T[0]

Some points

  • I use .toarray() because I cannot subtract a scalar from a sparse vector - we would get the error NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
  • I need to take the dot product between dist_mat and p.T, the transpose of p, so dimensions match
  • I return out transposed and it's first element, to output the desired vector
  • This works for sparse inputs p and dist_mat

Upvotes: 0

Divakar
Divakar

Reputation: 221634

Vectorized solution

Here's the final vectorized solution that I arrived at through few optimizations and one crucial trick, assuming s as the input sparse matrix of type csr_matrix.

k1 = np.sqrt(1/2)
k2s = np.sqrt(test.dot(test))
out = k1*np.sqrt(k2s + s.sum(1).A1 -2*np.sqrt(s*test))

Playing back the history

The final vectorized solution was reached after a series of optimizations that I would try to playback for my and others reference and I apologize for being verbose here, but I feel that's needed.

Stage #1

Starting off with in-ling the func defintiion in a loop :

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] = np.sqrt(1/2) * np.sqrt(  np.sum((np.sqrt(test) - np.sqrt(ai))**2)  )

Stage #2

Get the constants out and perform squared root outside :

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] = np.sum((k2 - np.sqrt(ai))**2)

out = np.sqrt(out)
out *= k1

Stage #3 (Crucial trick)

Crucial trick here as we would use the math formula :

(A-B)**2 = A**2) + B**2 - 2*A*B

Thus,

sum((A-B)**2) = sum(A**2) + sum(B**2) - 2*sum(A*B)

The last part sum(A*B) is simply matrix multiplication and that's major performance booster here.

Simplifies to :

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] = (k2**2).sum() + (np.sqrt(ai))**2).sum() -2*np.sqrt(ai).dot(k2)

out = np.sqrt(out)
out *= k1

Further simplifies to :

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] = (k2**2).sum() + ai.sum() -2*np.sqrt(ai).dot(k2)

out = np.sqrt(out)
out *= k1

Stage #4

Get the constant (k2**2).sum() out and get the row-wise summation of sparse matrix out too :

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)
k2s = (k2**2).sum()

N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
    ai = s[i].toarray()
    out[i] =  -2*np.sqrt(ai).dot(k2)

out += k2s + s.sum(1).A1 # row-wise summation of sparse matrix added here
out = np.sqrt(out)
out *= k1

Stage #5

The final trick is to remove the loop altogether. So, in the loop we have each output element being computed with np.sqrt(s[i]).dot(k2). That matrix-multiplcation could be done across all rows with simply : np.sqrt(s)*k2. That's all!

The remains would be :

k1 = np.sqrt(1/2)
k2 - np.sqrt(test)
k2s = (k2**2).sum()

out = -2*np.sqrt(s)*k2 # Loop gone here
out += k2s + s.sum(1).A1
out = np.sqrt(out)
out *= k1

That simplifies to after using inner dot product to get k2s -

k1 = np.sqrt(1/2)
k2 = np.sqrt(test)
k2s = k2.dot(k2)
out = k1*np.sqrt(k2s + s.sum(1).A1 -2*np.sqrt(s)*k2)

We could avoid the square-root computation for test to get k2 and thus further simplify things like so -

k1 = np.sqrt(1/2)
k2s = np.sqrt(test.dot(test))
out = k1*np.sqrt(k2s + s.sum(1).A1 -2*np.sqrt(s*test))

Upvotes: 2

Related Questions