Why does this python list need so much more memory?

Question

I want to create a list of numpy arrays in python. The arrays are mostly zeroes, with few flags set to one.

When running the following code, I run into memory issues. The code requires ~double the memory that I expect it to use.

Python loop to fill the list:

vectorized_data = []
os.system("free -m")
for dat in data: #data has length 200000
    one_hot_vector = np.zeros(6000)
    for d in dat:
        one_hot_vector[d] = 1
    vectorized_data.append(one_hot_vector)
os.system("free -m") ##memory usage goes up by ~7.5GB

Amount of memory I expect this code to use (vector dimension: 6000, #samples: 200000, numpy float bytes: 4):

(6000 * 200000 * 4) /(2**30.0) ~= 4.47 GB

Amount of memory actually used:

~7.5 GB

Is there any more memory-efficient way of achieving this?

jmunsch · Accepted Answer

could use a generator and row/column id something like:

def yield_row(data):
  for r_id, dat in enumerate(data):
      tmp = np.zeros(6000)
      for d in dat:
          tmp[d] = 1
      yield r_id, tmp

for r_id, tmp in yield_row(data):
  if is_hot_vector(tmp):
    do_stuff()

This approach has the downside of only having access to the row/column ids and the current tmp row, however it reduces the amount of memory needed to data plus one row.

Another approach might be to add only the row id to a list instead of the entire row, just index the row, and if needed add the translation/transform.

Why does this python list need so much more memory?

Answers (1)

Related Questions