Storing large amount of boolean data in python

I need to store sparse matrix data. Size of data is 10^6 10^4 columns. In each column I store a vector of 0's except of few values where it's true.

Then I need to sum over columns in each matrix, and multiply each row with a scalar. I tried dictionaries, but they fail when I need to sum, and multiply.

What would You use?

PS. numpy.zeros is too small

Upvotes: 4

Views: 1478

Answers (3)

dawg
dawg

Reputation: 103844

Depending on your needs, there are literally hundreds of methods to do this. The Sparse Matrix entry on Wikipedia is a good start to figure a method that applies specifically to your needs.

As an extremely simple example, you could use a Dictionary of Keys class like this:

class SparseDOK(dict):

    def __init__(self):
        pass

    def __setitem__(self,key,value):
        if value in[0,0.0,False,None]:
            dict.__setitem__(self,key,False)
            dict.__delitem__(self,key)
        else:
            dict.__setitem__(self,key,True)

    def __getitem__(self, key):    
        try: 
            return dict.__getitem__(self, key)

        except KeyError: 
            return False


>>> dok=SparseDOK()
>>> dok[10,20]=55
>>> print dok
{(10, 20): True}
>>> print dok[10,20]
True
>>> print dok[55,300]      
False
>>> dok[10,20]=False
>>> print dok[10,20]
False

Every entry in an arbitrary 'matrix' is assumed to be False, unless specifically set to True. You would need to add error checking, but this will be very compact and fast.

The advantage of constructing a Dictionary of Key is very efficient construction of the data structure. You only need to go through the original data once and you can easily add or delete data. The disadvantage is less interactive processing of the matrix once you have constructed it.

Since the dictionary keys are tuples, it is trivial to add the indices by row or column. Since the entire matrix would need to be processed after construction to do this, we can just construct a dict with whatever sum or product is desired once and then refer to that dict of processed data.

>>> dok[10,20]=True
>>> dok[10,2000]=True
>>> dok[11,2000]=True
>>> dok[35000,2000]=True
>>> dok[10,35000]=True
>>> print dok
{(11, 2000): True, (10, 2000): True, (35000, 2000): True, (10, 20): True, (10, 35000): True}
cols={}
for tup in dok.keys():
    if tup[1] not in cols:
        cols[tup[1]]=1
    else:
        cols[tup[1]]+=1    

>>> print cols
{2000: 3, 35000: 1, 20: 1}

Now you can refer to the col key in cols for the sum of the rows by col. It is trivial to add product, etc. Just remember you need to recalculate the sums / products if the original DOK is edited or changed. You could keep a running total if you anticipate that the DOK would change often after it was first created.

If your needs are more complex, consider using SciPy or Pysparse. As you can see, there are 7 different sparse matrix format in SciPy. Don't reinvent something the others have already done better...

Upvotes: 1

Tim Pietzcker
Tim Pietzcker

Reputation: 336158

How about two dicts? Assuming this is the matrix (x for True):

   0  1  2  3  4  5  6  7
0  x     x        x 
1     x
2                       x
3              x
4
5
6        x        x
7

You'd only need to store

rows = {0: [0, 2, 5], 1: [1], 2: [7], 3: [4], 6: [2, 5]}

You could easily transform this into

columns = {0: [0], 1: [1], 2: [0, 6], 4: [3], 5: [0, 6], 7: [2]}

using something like

columns = {}
for row in rows:
    for column in rows[row]:
        columns.setdefault(column, []).append(row)

and then sum over columns (sum(1 for x in column[2])) or over rows and multiply the result with whatever you want.

Upvotes: 4

JoshAdel
JoshAdel

Reputation: 68682

As others have mentioned, you should look at scipy.sparse:

http://docs.scipy.org/doc/scipy/reference/sparse.html

There are a bunch of different formats that are optimized for various sparse operations including scalar multiplication and summation.

For example:

import scipy.sparse
import numpy as np

rows = np.array([1,100,1000])
cols = np.array([100,99,1474])
vals = np.ones_like(rows)

A = scipy.sparse.coo_matrix((vals,(rows,cols)),shape=(int(1E6),int(1E6)),dtype=np.bool)

Then to multiply by a scalar and take sum:

B = 3*A
B.sum() # 9

Upvotes: 2

Related Questions