Reputation: 679
I have a very large scipy sparse csr matrix. It is a 100,000x2,000,000 dimensional matrix. Let's call it X
. Each row is a sample vector in a 2,000,000 dimensional space.
I need to calculate the cosine distances between each pair of samples very efficiently. I have been using sklearn pairwise_distances
function with a subset of vectors in X
which gives me a dense matrix D: the square form of the pairwise distances which contains redundant entries. How can I use sklearn pairwise_distances
to get the condensed form directly? Please refer to http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html to see what the condensed form is. It is the output of scipy pdist
function.
I have memory limitations and I can't calculate the square form and then get the condensed form. Due to memory limitations, I also cannot use scipy pdist
as it requires a dense matrix X
which does not again fit in memory. I thought about looping through different chunks of X
and calculate the condensed form for each chunk and join them together to get the complete condensed form, but this is relatively cumbersome. Any better ideas?
Any help is much much appreciated. Thanks in advance.
Below is a reproducible example (of course for demonstration purposes X
is much smaller):
from scipy.sparse import rand
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances
X = rand(1000, 10000, density=0.01, format='csr')
dist1 = pairwise_distances(X, metric='cosine')
dist2 = pdist(X.A, 'cosine')
As you see dist2
is in the condensed form and is a 499500 dimensional vector. But dist1
is in the symmetric squareform and is a 1000x1000 matrix.
Upvotes: 1
Views: 3135
Reputation: 231665
I dug into the code for both versions, and think I understand what both are doing.
Start with a small simple X
(dense):
X = np.arange(9.).reshape(3,3)
pdist
cosine does:
norms = _row_norms(X)
_distance_wrap.pdist_cosine_wrap(_convert_to_double(X), dm, norms)
where _row_norms
is a row dot - using einsum
:
norms = np.sqrt(np.einsum('ij,ij->i', X,X)
So this is the first place where X
has to be an array.
I haven't dug into the cosine_wrap, but it appears to do (probably in cython)
xy = np.dot(X, X.T)
# or xy = np.einsum('ij,kj',X,X)
d = np.zeros((3,3),float) # square receiver
d2 = [] # condensed receiver
for i in range(3):
for j in range(i+1,3):
val=1-xy[i,j]/(norms[i]*norms[j])
d2.append(val)
d[j,i]=d[i,j]=val
print('array')
print(d)
print('condensed',np.array(d2))
from scipy.spatial import distance
d1=distance.pdist(X,'cosine')
print(' pdist',d1)
producing:
array
[[ 0. 0.11456226 0.1573452 ]
[ 0.11456226 0. 0.00363075]
[ 0.1573452 0.00363075 0. ]]
condensed [ 0.11456226 0.1573452 0.00363075]
pdist [ 0.11456226 0.1573452 0.00363075]
distance.squareform(d1)
produces the same thing as my d
array.
I can produce the same square array by dividing the xy
dot product with the appropriate norm
outer product:
dd=1-xy/(norms[:,None]*norms)
dd[range(dd.shape[0]),range(dd.shape[1])]=0 # clean up 0s
Or by normalizing X
before taking dot product. This appears to be what the scikit
version does.
Xnorm = X/norms[:,None]
1-np.einsum('ij,kj',Xnorm,Xnorm)
scikit
has added some cython code to do faster sparse calculations (beyond those provided by sparse.sparse
, but using the same csr
format):
from scipy import sparse
Xc=sparse.csr_matrix(X)
# csr_row_norm - pyx of following
cnorm = Xc.multiply(Xc).sum(axis=1)
cnorm = np.sqrt(cnorm)
X1 = Xc.multiply(1/cnorm) # dense matrix
dd = 1-X1*X1.T
To get a fast condensed form with sparse matrices I think you need to implement a fast condensed version of X1*X1.T
. That means you need to understand how the sparse matrix multiplication is implemented - in c
code. The scikit
cython 'fast sparse' code might also give ideas.
numpy
has some tri...
functions which are straight forward Python code. It does not attempt to save time or space by implementing tri calculations directly. It's easier to iterate over the rectangular layout of a nd array (with shape and strides) than to do the more complex variable length steps of a triangular array. The condensed form only cuts the space and calculation steps by half.
============
Here's the main part of the c
function pdist_cosine
, which iterates over i
and the upper j
, calculating dot(x[i],y[j])/(norm[i]*norm[j])
.
for (i = 0; i < m; i++) {
for (j = i + 1; j < m; j++, dm++) {
u = X + (n * i);
v = X + (n * j);
cosine = dot_product(u, v, n) / (norms[i] * norms[j]);
if (fabs(cosine) > 1.) {
/* Clip to correct rounding error. */
cosine = npy_copysign(1, cosine);
}
*dm = 1. - cosine;
}
}
https://github.com/scipy/scipy/blob/master/scipy/spatial/src/distance_impl.h
Upvotes: 4