StatsSorceress
StatsSorceress

Reputation: 3099

Unexpected behaviour from scipy.sparse.csr_matrix data

Something's odd with the data here.

If I create a scipy.sparse.csr_matrix with the data property containing only 0s and 1s, and then ask it to print the data property, sometimes there are 2s in the output (other times not).

You can see this behaviour here:

from scipy.sparse import csr_matrix
import numpy as np
from collections import OrderedDict

#Generate some fake data
#This makes an OrderedDict of 10 scipy.sparse.csr_matrix objects, 
#with 3 rows and 3 columns and binary (0/1) values

od = OrderedDict()
for i in range(10):
        row = np.random.randint(3, size=3)
        col = np.random.randint(3, size=3)
        data = np.random.randint(2, size=3)
        print 'data is: ', data
        sp_matrix = csr_matrix((data, (row, col)), shape=(3, 3))
        od[i] = sp_matrix

#Print the data in each scipy sparse matrix
for i in range(10):
        print 'data stored in sparse matrix: ',  od[i].data

It'll print something like this:

data is:  [1 0 1]
data is:  [0 0 1]
data is:  [0 0 0]
data is:  [0 0 0]
data is:  [1 1 1]
data is:  [0 0 0]
data is:  [1 1 0]
data is:  [1 0 1]
data is:  [0 0 0]
data is:  [0 0 1]
data stored in sparse matrix:  [1 1 0]
data stored in sparse matrix:  [0 0 1]
data stored in sparse matrix:  [0 0]
data stored in sparse matrix:  [0 0 0]
data stored in sparse matrix:  [2 1]
data stored in sparse matrix:  [0 0 0]
data stored in sparse matrix:  [1 1 0]
data stored in sparse matrix:  [1 1 0]
data stored in sparse matrix:  [0 0 0]
data stored in sparse matrix:  [1 0 0]

Why does the data stored in the sparse matrix not reflect the data originally put there (there were no 2s in the original data)?

Upvotes: 1

Views: 431

Answers (1)

sascha
sascha

Reputation: 33532

I'm assuming, your kind of matrix-creation:

sp_matrix = csr_matrix((data, (row, col)), shape=(3, 3))

will use coo_matrix under the hood (not found the relevant sources yet; see bottom).

In this case, the docs say (for COO):

By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)

Your random-matrix routine does not check for duplicate entries.

Edit: Ok. It think i found the code.

csr_matrix: no constructor-code -> inheritance from _cs_matrix

compressed.py: _cs_matrix

and there:

      else:
            if len(arg1) == 2:
                # (data, ij) format
                from .coo import coo_matrix
                other = self.__class__(coo_matrix(arg1, shape=shape))
                self._set_self(other)

Upvotes: 2

Related Questions