Art
Art

Reputation: 13

numpy array of histograms

I am currently working with a 2d numpy object array filled with collections.counter objects Each counter is basically a histogram.

This all works fine for my needs with smaller datasets but with a dataset around the 500 million cells mark the memory use is around 120Gb which is a little high.

Interestingly numpy.save writes it out to a 4gb file which makes me think there is something better i can be doing.

Any suggestions on how i can reduce my memory usage.

I considered a 3d array but because of the amount of empty counts it would have to hold it required even more memory.

I make lots of use of counter.update in constructing the array so any method needs a quick/neat way of getting similar functionality.

The access after the data is created isnt a big issue as long for each cell i can get the value for each key - no need for a dictionaries indexing.

Below is a very simplified example that produces a small dataset that is roughly analogous to what ive described above. My code would have a skew further towards less keys per counter and higher counts per key

def counterArray_init(v):
    return collections.Counter([v])

e = np.random.random_integers(0,1500,[10,10])
row_len, col_len = e.shape
counterArray = np.zeros([row_len,col_len], dtype= object)
vinit = np.vectorize(counterArray_init)
counterArray[:,:] = vinit(e)
for row in xrange(1,row_len):
    for col in xrange(0,col_len):
       counterArray[row,col].update(counterArray[row - 1,col])
return counterArray

Thanks

Edit: I have realised that in my smaller counters the keys used fall within a small range. The random example code above is not a good example of this behaviour. As a result i am investigating using an object array filled with different length int arrays and a separate array that stores the minimum key value for each of those int arrays. It seems like an odd solution but initial testing looks like its using only about 20% of the memory used by the counter method.

Upvotes: 1

Views: 830

Answers (3)

Paul
Paul

Reputation: 43620

A quick test shows storing tuples instead of Counter objects would save you only about 20%. (might be worth double checking in your use case). And if storing as plain int array is even less efficient, then there are few other choices.

Sparse arrays are a good space-saver, but don't offer the same boadcasting as a normal array, they are usually just used to create or store data, then converted to normal arrays for computations. If you walk your indicies with regular python loops, then sparse arrays might be a good solution.

numpy.save must be doing some kind of compression. Compression which is not likely to be useful on data in active use in memory. Are you using pyTables or h5py? How are you currently managing 120G of data - Virtual Memory?

Upvotes: 0

Charles Brunet
Charles Brunet

Reputation: 23120

I would definitively use a 3D array, since keys are integers. What is the maximum count for a specific item? If it is below 255, you could also change the datatype of the array to be 8 bits.

Upvotes: 1

jeff7
jeff7

Reputation: 2182

If there are lots of empty counts,a sparse matrix representation may be a good fit, where the memory use is proportional to the number of non-empty elements in the array. SciPy has decent support for what it sounds like you're looking at: scipy.sparse

Upvotes: 1

Related Questions