Memory problems for multiple large arrays

Question

I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).

This is what I currently have now:

import numpy as np

toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("
"), dtype=int)

patients = range(1, 1000, 1)

The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.

I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.

patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]

I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.

def log_likely_list(patient, remove_index_list):
    array_data = np.load("data/{}/array.npy".format(patient)).ravel()
    return np.delete(array_data, remove_index_list)


remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]

Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.

patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))

The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.

log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
             np.sum(np.log(1 - patients_no_tox_list), axis=1)

My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down. I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.

So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?

mooglinux · Accepted Answer

Each time you use a list comprehension, you create a new copy of the data in memory. So this line:

final_list = [log_likely_list(patient, remove_index_list) for patient in patients]

contains the complete data for all 1000 patients!

The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:

with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)

no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)

final_data = itertools.chain(with_tox_log, no_tox_log)

Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:

log_likely = functools.reduce(np.add, final_data)

Memory problems for multiple large arrays

Answers (1)

Related Questions