canzar
canzar

Reputation: 340

Error converting large sparse matrix to COO

I ran into the following issue trying to vstack two large CSR matrices:

    /usr/lib/python2.7/dist-packages/scipy/sparse/coo.pyc in _check(self)
    229                 raise ValueError('negative row index found')
    230             if self.col.min() < 0:
--> 231                 raise ValueError('negative column index found')
    232
    233     def transpose(self, copy=False):

ValueError: negative column index found

I can reproduce this error very simply by trying to convert a large lil matrix to a coo matrix. The following code works for N=10**9 but fails for N=10**10.

from scipy import sparse
from numpy import random
N=10**10
x = sparse.lil_matrix( (1,N) )
for _ in xrange(1000):
    x[0,random.randint(0,N-1)]=random.randint(1,100)

y = sparse.coo_matrix(x)

Is there a size limit I am hitting for coo matrices? Is there a way around this?

Upvotes: 11

Views: 6260

Answers (2)

DrV
DrV

Reputation: 23480

Interestingly, your second example runs well with my installation.

The error message `negative column index found´ sounds like an overflow somewhere. I checked the newest source with the following results:

  • The actual indexing datatype is calculated in scipy.sparse.sputils.get_index_dtype
  • The error message comes form the module scipy.sparse.coo

The exception comes from this kind of code:

    idx_dtype = get_index_dtype(maxval=max(self.shape))
    self.row = np.asarray(self.row, dtype=idx_dtype)
    self.col = np.asarray(self.col, dtype=idx_dtype)
    self.data = to_native(self.data)

    if nnz > 0:
        if self.row.max() >= self.shape[0]:
            raise ValueError('row index exceeds matrix dimensions')
        if self.col.max() >= self.shape[1]:
            raise ValueError('column index exceeds matrix dimensions')
        if self.row.min() < 0:
            raise ValueError('negative row index found')
        if self.col.min() < 0:
            raise ValueError('negative column index found')

It is a clear overflow error at - probably - 2**31.

If you want to debug it, try:

import scipy.sparse.sputils
import numpy as np

scipy.sparse.sputils.get_index_dtype((np.array(10**10),))

It should return int64. IF it doesn't the problem is there.

Which version of SciPy?

Upvotes: 6

perimosocordiae
perimosocordiae

Reputation: 17787

Looks like you're hitting the limits of 32-bit integers. Here's a quick test:

In [14]: np.array([10**9, 10**10], dtype=np.int64)
Out[14]: array([ 1000000000, 10000000000])

In [15]: np.array([10**9, 10**10], dtype=np.int32)
Out[15]: array([1000000000, 1410065408], dtype=int32)

For now, most sparse matrix representations assume 32-bit integer indices, so they simply cannot support matrices that large.

EDIT: As of version 0.14, scipy now supports 64-bit indexing. If you can upgrade, this problem will go away.

Upvotes: 8

Related Questions