tastyminerals
tastyminerals

Reputation: 6538

How to efficiently increase the size of a sparse vector in python?

I have a dictionary of keys where each value should be a sparse vector of a huge size (~ 700000 elements, maybe more). How do I efficiently grow / build this data structure. Right now my implementation works only for smaller sizes.

myvec = defaultdict(list)
for id in id_data:
    for item in item_data:
        if item in item_data[id]:
            myvec[id].append(item * 0.5)
        else:
            myvec[id].append(0)

The above code when used with huge files quickly eats up all the available memory. I tried removing the myvec[id].append(0) condition and store only non-zero values because the length of each myvec[id] list is constant. That worked on my huge test file with a decent memory consumption but I'd rather find a better way to do it.

I know that there are different type of sparse arrays/matrices for this purpose but I have no intuition which one is better. I tried to use lil_matrix from numpy package instead of myvec dict but it turned out to be much slower than the above code.

So the problem basically boils down to the following two questions:

  1. Is it possible to create a sparse data structure on the fly in python?

  2. How can one create such sparse data structure with decent speed?

Upvotes: 3

Views: 15402

Answers (1)

hpaulj
hpaulj

Reputation: 231325

Appending to a list (or lists) will always be faster than appending to a numpy.array or to a sparse matrix (which stores data in several numpy arrays). lil is supposed to be the fastest when you have to grow the matrix incrementally, but it still will slower than working directly with lists.

Numpy arrays have a fixed size. So the np.append function actually creates a new array by concatenating the old with the new data.

You example code would be more useful if you gave us some data, so we cut, paste and run.

For simplicity lets define

data_dict=dict(one=[1,0,2,3,0,0,4,5,0,0,6])

Sparse matrices can be created directly from this with:

sparse.coo_matrix(data_dict['one'])

whose attributes are:

data:  array([1, 2, 3, 4, 5, 6])
row:   array([0, 0, 0, 0, 0, 0], dtype=int32)
col:   array([ 0,  2,  3,  6,  7, 10], dtype=int32)

or

sparse.lil_matrix(id_data['one'])
data: array([[1, 2, 3, 4, 5, 6]], dtype=object)
rows: array([[0, 2, 3, 6, 7, 10]], dtype=object)

The coo version times a lot faster.

The sparse matrix only saves the nonzero data, but it also has to save an index. There is also a dictionary format, which uses a tuple (row,col) as the key.

And example of incremental construction is:

llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
    llm[0,i]=data_dict['one'][i]

For this small case this incremental approach is faster.

I get even better speed by only adding the nonzero terms to the sparse matrix:

llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
    if data_dict['one'][i]!=0:
       llm[0,i]=data_dict['one'][i]

I can imagine adapting this to your default dict example. Instead of myvec[id].append(0), you keep a record of where you appended the item * 0.5 values (whether in a separate list, or via a lil_matrix. It would take some experimenting to adapt this idea to a default dictionary.

So basically the goal is to create 2 lists:

data = [1, 2, 3, 4, 5, 6]
cols = [ 0,  2,  3,  6,  7, 10]

Whether you create a sparse matrix from these or not depends on what else you need to do with the data.

Upvotes: 4

Related Questions