JRun
JRun

Reputation: 679

How to get the condensed form of pairwise distances directly?

I have a very large scipy sparse csr matrix. It is a 100,000x2,000,000 dimensional matrix. Let's call it X. Each row is a sample vector in a 2,000,000 dimensional space.

I need to calculate the cosine distances between each pair of samples very efficiently. I have been using sklearn pairwise_distances function with a subset of vectors in X which gives me a dense matrix D: the square form of the pairwise distances which contains redundant entries. How can I use sklearn pairwise_distances to get the condensed form directly? Please refer to http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html to see what the condensed form is. It is the output of scipy pdist function.

I have memory limitations and I can't calculate the square form and then get the condensed form. Due to memory limitations, I also cannot use scipy pdist as it requires a dense matrix X which does not again fit in memory. I thought about looping through different chunks of X and calculate the condensed form for each chunk and join them together to get the complete condensed form, but this is relatively cumbersome. Any better ideas?

Any help is much much appreciated. Thanks in advance.

Below is a reproducible example (of course for demonstration purposes X is much smaller):

from scipy.sparse import rand
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances
X = rand(1000, 10000, density=0.01, format='csr')
dist1 = pairwise_distances(X, metric='cosine')
dist2 = pdist(X.A, 'cosine')

As you see dist2 is in the condensed form and is a 499500 dimensional vector. But dist1 is in the symmetric squareform and is a 1000x1000 matrix.

Upvotes: 1

Views: 3135

Answers (1)

hpaulj
hpaulj

Reputation: 231665

I dug into the code for both versions, and think I understand what both are doing.

Start with a small simple X (dense):

X = np.arange(9.).reshape(3,3)

pdist cosine does:

norms = _row_norms(X)
_distance_wrap.pdist_cosine_wrap(_convert_to_double(X), dm, norms)

where _row_norms is a row dot - using einsum:

norms = np.sqrt(np.einsum('ij,ij->i', X,X)

So this is the first place where X has to be an array.

I haven't dug into the cosine_wrap, but it appears to do (probably in cython)

xy = np.dot(X, X.T)
# or xy = np.einsum('ij,kj',X,X)

d = np.zeros((3,3),float)   # square receiver
d2 = []                     # condensed receiver
for i in range(3):
    for j in range(i+1,3):
         val=1-xy[i,j]/(norms[i]*norms[j])
         d2.append(val)
         d[j,i]=d[i,j]=val

print('array')
print(d)
print('condensed',np.array(d2))

from scipy.spatial import distance
d1=distance.pdist(X,'cosine')
print('    pdist',d1)

producing:

array
[[ 0.          0.11456226  0.1573452 ]
 [ 0.11456226  0.          0.00363075]
 [ 0.1573452   0.00363075  0.        ]]

condensed [ 0.11456226  0.1573452   0.00363075]
    pdist [ 0.11456226  0.1573452   0.00363075]

distance.squareform(d1) produces the same thing as my d array.

I can produce the same square array by dividing the xy dot product with the appropriate norm outer product:

dd=1-xy/(norms[:,None]*norms)
dd[range(dd.shape[0]),range(dd.shape[1])]=0 # clean up 0s

Or by normalizing X before taking dot product. This appears to be what the scikit version does.

Xnorm = X/norms[:,None]
1-np.einsum('ij,kj',Xnorm,Xnorm)

scikit has added some cython code to do faster sparse calculations (beyond those provided by sparse.sparse, but using the same csr format):

from scipy import sparse
Xc=sparse.csr_matrix(X)

# csr_row_norm - pyx of following
cnorm = Xc.multiply(Xc).sum(axis=1)
cnorm = np.sqrt(cnorm)
X1 = Xc.multiply(1/cnorm)  # dense matrix
dd = 1-X1*X1.T

To get a fast condensed form with sparse matrices I think you need to implement a fast condensed version of X1*X1.T. That means you need to understand how the sparse matrix multiplication is implemented - in c code. The scikit cython 'fast sparse' code might also give ideas.

numpy has some tri... functions which are straight forward Python code. It does not attempt to save time or space by implementing tri calculations directly. It's easier to iterate over the rectangular layout of a nd array (with shape and strides) than to do the more complex variable length steps of a triangular array. The condensed form only cuts the space and calculation steps by half.

============

Here's the main part of the c function pdist_cosine, which iterates over i and the upper j, calculating dot(x[i],y[j])/(norm[i]*norm[j]).

for (i = 0; i < m; i++) {
    for (j = i + 1; j < m; j++, dm++) {
        u = X + (n * i);
        v = X + (n * j);
        cosine = dot_product(u, v, n) / (norms[i] * norms[j]);
        if (fabs(cosine) > 1.) {
            /* Clip to correct rounding error. */
            cosine = npy_copysign(1, cosine);
        }
        *dm = 1. - cosine;
    }
}

https://github.com/scipy/scipy/blob/master/scipy/spatial/src/distance_impl.h

Upvotes: 4

Related Questions