Reputation: 11787
Is it possible to effectively obtain the norm of a sparse vector in python?
I tried the following:
from scipy import sparse
from numpy.linalg import norm
vector1 = sparse.csr_matrix([ 0 for i in xrange(4000000) ], dtype = float64)
#just to test I set a few points to a value higher than 0
vector1[ (0, 10) ] = 5
vector1[ (0, 1500) ] = 80
vector1[ (0, 2000000) ] = 6
n = norm(t1)
but then I get the error:
ValueError: dimension mismatch
The norm function only works with arrays so probably that's why the csr_matrix is not working, but then I didn't find another way of computing the norm effectively. One possible solution would be to compute:
norm(asarray(vector1.todense()))
but then it kills the purpose of using sparse vectors at first. And as the last approach I could iterate through each element of the vector and compute manually the norm, but since efficiency is really important I was searching for something faster and easier to implement.
Thanks in advance for any help!
EDIT: I tried everything that was suggested and the best solution is:
(vector1.data ** 2).sum()
from Dougal. But the Cython solution is also very good and works better as the vector grows in number of elements different of zero. Thanks everyone for the help!
Upvotes: 1
Views: 3362
Reputation: 547
I just had the same problem here, I implemented a function in cython to increase the speed of this simple operation. I tested it with a 4M sparse vector of doubles with 100k non-zero elements. The method using sqrt(vector.multiply(vector).sum()) used 874us and my function 205us.
# sparseLib.pyx
#cython: boundscheck=False
from cython.parallel cimport prange
from cython.view cimport array as cvarray
import numpy as np
from libc.math cimport sqrt
cpdef double sparseNorm2(double [:] data) nogil:
cdef long i
cdef double value = 0.0
for i in xrange(data.shape[0]):
value += data[i]*data[i]
return sqrt(value)
Upvotes: 1
Reputation: 8975
vector1.data
directly. You can also use things like vector1.multiply(vector1)
plus .sum
or vector1.dot(vector1.T)
but as Dougal pointed out, that can be much slower for this simple case.Upvotes: 1
Reputation: 2661
I don't think your initialization is doing what you think it is.
For norm to work, you need to have a square array. If you are trying to make a square array with 4million elements, you want to do
csr_matrix( (2000,2000), dtype=float64)
full documentation for initialization at scipy
Upvotes: 0