NUMPY Operations: Memory Efficiency: PYTHON

Question

I have a memory constraint of 4GB RAM. I need to have 2.5 GB of data in RAM in order to perform further things

import numpy
a = numpy.random.rand(1000000,100)## This needs to be in memory
b= numpy.random.rand(1,100)
c= a-b #this need to be in code in order to perform next operation
d = numpy.linalg.norm(numpy.asarray(c, dtype = numpy.float32), axiss =1)

While creating c, memory usage explodes and python gets killed. Is there a way to fasten this process. I am performing this on EC2 Ubuntu with 4GB RAM and single core. When I perform same calculation on my MAC OSX, it gets easily done without any memory problem as well as takes less time. Why is this happening?

One solution that I can think of is

d =[numpy.sqrt(numpy.dot(i-b,i-b)) for i in a]

which I dont think will be good for speed.

Warren Weckesser · Accepted Answer

If the creation of a doesn't cause a memory problem, and you don't need to preserve the values in a, you could compute c by modifying a in place:

a -= b  # Now use `a` instead of `c`.

Otherwise, the idea of working in smaller chunks or batches is a good one. With your list comprehension solution, you are, in effect, computing d from a and b in batch sizes of one row of a. You can improve the efficiency by using a larger batch size. Here's an example; it includes your code (with some cosmetic changes) and a version that computes the result (called d2) in batches of batch_size rows of a.

import numpy as np

#n = 1000000
n = 1000
a = np.random.rand(n,100)  ## This needs to be in memory
b = np.random.rand(1,100)
c = a-b  # this need to be in code in order to perform next operation
d = np.linalg.norm(np.asarray(c), axis=1)

batch_size = 300
# Preallocate the result.
d2 = np.empty(n)
for start in range(0, n, batch_size):
    end = min(start + batch_size, n)
    c2 = a[start:end] - b
    d2[start:end] = np.linalg.norm(c2, axis=1)

NUMPY Operations: Memory Efficiency: PYTHON

Answers (2)

Related Questions