Reputation: 3263
I want to create a list of numpy arrays in python. The arrays are mostly zeroes, with few flags set to one.
When running the following code, I run into memory issues. The code requires ~double the memory that I expect it to use.
Python loop to fill the list:
vectorized_data = []
os.system("free -m")
for dat in data: #data has length 200000
one_hot_vector = np.zeros(6000)
for d in dat:
one_hot_vector[d] = 1
vectorized_data.append(one_hot_vector)
os.system("free -m") ##memory usage goes up by ~7.5GB
Amount of memory I expect this code to use (vector dimension: 6000, #samples: 200000, numpy float bytes: 4):
(6000 * 200000 * 4) /(2**30.0) ~= 4.47 GB
Amount of memory actually used:
~7.5 GB
Is there any more memory-efficient way of achieving this?
Upvotes: 1
Views: 129
Reputation: 24109
could use a generator and row/column id something like:
def yield_row(data):
for r_id, dat in enumerate(data):
tmp = np.zeros(6000)
for d in dat:
tmp[d] = 1
yield r_id, tmp
for r_id, tmp in yield_row(data):
if is_hot_vector(tmp):
do_stuff()
This approach has the downside of only having access to the row/column ids and the current tmp row, however it reduces the amount of memory needed to data
plus one row.
Another approach might be to add only the row id to a list instead of the entire row, just index the row, and if needed add the translation/transform.
Upvotes: 2