shogen
shogen

Reputation: 11

sparse matrix python segmentation fault

I am getting segmentation fault when I multiply a scipy sparse matrix by its transpose. I've searched all over the Internet but could not find any answer. Any help is appreciated.

>>> import cPickle
>>> fs = open('vec.pickle', 'rb')
>>> vec = cPickle.load(fs)
>>> vec
<3020x512 sparse matrix of type '<type 'numpy.float64'>' with 26008 stored elements in Compressed Sparse Column format>
>>> vec.max()
10.0
>>> vec.min()
0.0
>>> vec * vec.T
Segmentation fault: 11

I do not think this is memory issue since the dimension is small. The vec object is created by gensim, if that information helps.

I also do not think this is overflow issues since the range of element is [0.0, 10.0]

The pickle object is here: https://drive.google.com/open?id=0B3DJbsn85XMvdmFYT0MzZVFjOVU

Upvotes: 1

Views: 938

Answers (1)

hpaulj
hpaulj

Reputation: 231475

When I load this vec and

In [13]: vec.tocoo()

ValueError                                Traceback (most recent call    ....
    226             if self.col.max() >= self.shape[1]:
    227                 raise ValueError('column index exceedes matrix dimensions')

ValueError: row index exceedes matrix dimensions

So something is faulty in the pickled object.

In [38]: vec
Out[38]: 
<3020x512 sparse matrix of type '<type 'numpy.float64'>'
    with 26008 stored elements in Compressed Sparse Column format>

In [37]: vec.indices.max()
Out[37]: 3255

by the shape, it's supposed to have 3020 rows, 512 columns. But the indices attribute gets up to 3255, larger than the number of rows.

So one question is, can we recover a valid matrix from this data? And another, was this valid when originally pickled. It's more likely a fault in gensim than in scipy.sparse.

Until it is valid with simple tests like this, I wouldn't jump to any conclusions about the vec*vec.T calculation.


I can create a new, valid sparse matrix with:

In [44]: newvec = sparse.csc_matrix((vec.data,vec.indices,vec.indptr))

In [45]: newvec.shape
Out[45]: (3256, 512)

In [46]: newvec * newvec.T
Out[46]: 
<3256x3256 sparse matrix of type '<type 'numpy.float64'>'
    with 314081 stored elements in Compressed Sparse Column format>

In [47]: newvec.tocoo()
Out[47]: 
<3256x512 sparse matrix of type '<type 'numpy.float64'>'
    with 26008 stored elements in COOrdinate format>

My guess is that the segment fault occurs in the compiled matrix multiplication. At some point the vec.indices references some vallue beyond the space allocated to C array. For the sake of speed, the C code is not checking bounds as thoroughly as normal Python and numpy code does. In effect matrix multiplication assumes its inputs are well formed.

Upvotes: 2

Related Questions