Reputation: 7338
I am trying to find the Hellinger distance between a single distribution p
and every row of a sparse matrix dist_mat
. I want to return a vector of dimension 1*N where N is the number of rows in dist_mat
.
def hellinger(p, dist_mat):
return np.sqrt(1/2) * np.sqrt( np.sum((np.sqrt(p) - np.sqrt(dist_mat))**2) )
Using the function above, if we try out a test case:
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
csr_matrix((data, (row, col)), shape=(3, 3)).toarray()
test = np.array([0,21,0])
hellinger(test,csr_matrix((data, (row, col)), shape=(3, 3)))
>>> 4.3633103660024926
which returns a scalar, not a vector. So for the example above I want a list of results containing the hellinger distances. Something like:
hellinger(test,csr_matrix((data, (row, col)), shape=(3, 3)))
>>> [3.46,3.46,2.78] # hellinger distance between test and each row of the csr sparse matrix
Is there some way I can return the desired vector of distances using numpy notation, perhaps using the np.apply_along_axis method? I have seen this done before, but can't seem to get it here. Thanks in advance.
NOTE: I want to avoid explicit for loops as these would be inefficient. I am looking for the most optimized / fastest way to do this.
Upvotes: 3
Views: 716
Reputation: 7338
Basing my answer on @Divakar's...
If the distributions $P$ and $Q$ that we are inputting into the Hellinger function are normalized (so $\sum_i P_i = 1$) (which they are in my case), then the function simplifies to
$\sqrt{1 - \sum_i^k \sqrt{P_iQ_i}}$
This works even if $Q$ is a matrix of row wise distribution vectors.
So we can write
def hellinger(p,dist_mat):
out = np.sqrt(1 - np.sqrt(dist_mat*p.T).toarray())
return out.T[0]
Some points
.toarray()
because I cannot subtract a scalar from a sparse vector - we would get the error NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
dist_mat
and p.T
, the transpose of p
, so dimensions matchout
transposed and it's first element, to output the desired vectorp
and dist_mat
Upvotes: 0
Reputation: 221634
Here's the final vectorized solution that I arrived at through few optimizations and one crucial trick, assuming s
as the input sparse matrix of type csr_matrix
.
k1 = np.sqrt(1/2)
k2s = np.sqrt(test.dot(test))
out = k1*np.sqrt(k2s + s.sum(1).A1 -2*np.sqrt(s*test))
The final vectorized solution was reached after a series of optimizations that I would try to playback for my and others reference and I apologize for being verbose here, but I feel that's needed.
Stage #1
Starting off with in-ling the func defintiion in a loop :
N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
ai = s[i].toarray()
out[i] = np.sqrt(1/2) * np.sqrt( np.sum((np.sqrt(test) - np.sqrt(ai))**2) )
Stage #2
Get the constants out and perform squared root outside :
k1 = np.sqrt(1/2)
k2 - np.sqrt(test)
N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
ai = s[i].toarray()
out[i] = np.sum((k2 - np.sqrt(ai))**2)
out = np.sqrt(out)
out *= k1
Stage #3 (Crucial trick)
Crucial trick here as we would use the math formula :
(A-B)**2 = A**2) + B**2 - 2*A*B
Thus,
sum((A-B)**2) = sum(A**2) + sum(B**2) - 2*sum(A*B)
The last part sum(A*B)
is simply matrix multiplication and that's major performance booster here.
Simplifies to :
k1 = np.sqrt(1/2)
k2 - np.sqrt(test)
N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
ai = s[i].toarray()
out[i] = (k2**2).sum() + (np.sqrt(ai))**2).sum() -2*np.sqrt(ai).dot(k2)
out = np.sqrt(out)
out *= k1
Further simplifies to :
k1 = np.sqrt(1/2)
k2 - np.sqrt(test)
N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
ai = s[i].toarray()
out[i] = (k2**2).sum() + ai.sum() -2*np.sqrt(ai).dot(k2)
out = np.sqrt(out)
out *= k1
Stage #4
Get the constant (k2**2).sum()
out and get the row-wise summation of sparse matrix out too :
k1 = np.sqrt(1/2)
k2 - np.sqrt(test)
k2s = (k2**2).sum()
N = s.shape[0]
out = np.zeros(N)
for i in range(s.shape[0]):
ai = s[i].toarray()
out[i] = -2*np.sqrt(ai).dot(k2)
out += k2s + s.sum(1).A1 # row-wise summation of sparse matrix added here
out = np.sqrt(out)
out *= k1
Stage #5
The final trick is to remove the loop altogether. So, in the loop we have each output element being computed with np.sqrt(s[i]).dot(k2)
. That matrix-multiplcation could be done across all rows with simply : np.sqrt(s)*k2
. That's all!
The remains would be :
k1 = np.sqrt(1/2)
k2 - np.sqrt(test)
k2s = (k2**2).sum()
out = -2*np.sqrt(s)*k2 # Loop gone here
out += k2s + s.sum(1).A1
out = np.sqrt(out)
out *= k1
That simplifies to after using inner
dot product to get k2s
-
k1 = np.sqrt(1/2)
k2 = np.sqrt(test)
k2s = k2.dot(k2)
out = k1*np.sqrt(k2s + s.sum(1).A1 -2*np.sqrt(s)*k2)
We could avoid the square-root computation for test
to get k2
and thus further simplify things like so -
k1 = np.sqrt(1/2)
k2s = np.sqrt(test.dot(test))
out = k1*np.sqrt(k2s + s.sum(1).A1 -2*np.sqrt(s*test))
Upvotes: 2