Maxwell A. Bertolero
Maxwell A. Bertolero

Reputation: 33

np.nanmean of 500 large numpy matrices

I am trying to get the average(ignoring nan values) of very large bumpy matrices. I know I can load them in without taking up too much memory, doing something like :

X=np.load('my_matrix_1.npy', mmap_mode='r')

And then I can read some lines from it. I was thinking of reading 1000 line at a time from each matrix, and storing the nan mean of those in a matrix the size of, a

so something like this:

for chunk in chunks:
     chunk_to_mean = []
     for matrix in matrices:
          X=np.load(matrix, mmap_mode='r')
          chuck_to_mean.append(X)
          del X
     matrix[chunk] = np.nanmean(chunk_to_mean)

however, it seems that I get a memory allocation error after on the second time I try to load something with memory mapping, even if I delete it. Does anyone know how to solve this, or perhaps have a better idea how to do it?

Upvotes: 1

Views: 735

Answers (2)

Maxwell A. Bertolero
Maxwell A. Bertolero

Reputation: 33

This is tested and works. It should work on any a very very large matrix, as it is only reading in one row of the matrix at a time. As long as you can hold the average matrix in memory, it will run.

def average_subject_real(subject):
   matrix = np.load('graph_matrix_%s_1.npy' %(subject), mmap_mode = 'r')
   matrix_size = matrix.shape[0]
   del matrix
   average_matrix = np.zeros((matrix_size,matrix_size))
   for line in range(matrix_size):
       temp_array = []
       for i in range(1,5):
           matrix = np.load('graph_matrix_%s_%s.npy' %(subject,i), mmap_mode = 'r')
           matrix = np.array(matrix[line]) # the copying happens here
        temp_array.append(matrix)
        del matrix
        average_matrix[line] = np.nanmean(temp_array,axis=0)
    np.save('graph_matrix_%s_average.npy' %(subject),arr=average_matrix)

Upvotes: 0

cleros
cleros

Reputation: 4343

This code has several problems. The first is that chunk_to_mean = [] creates an array. However, you add the numpy arrays as elements to that array - so it becomes a list of numpy arrays, but np.nanmean does not take a list of arrays, but an np.array.

The second one is that you either have a dictionary, and you can address an array using a string, or you have a list, and then you have to use an int. So your chunk[matrix] does not make sense.

So you have two options. If all the chunks have the same size, you can take the np.nanmean for each matrix, and then the mean of the resulting list.

results = []
for chunk in chunks:
    chunk_to_mean = []
    for matrix in matrices:
        X=np.load(matrix, mmap_mode='r')
        chuck_to_mean.append(np.nanmean(X))
        del X # why do you need this?? This is python - it has garbage collection!
    results.append(np.nanmean(np.array(chunk_to_mean)))

The other option, if you want a weighted mean, is to concatenate the matrices, and then take the np.nanmean of that:

results = []
for chunk in chunks:
    for i in range(len(matrices)):
        X=np.load(matrix, mmap_mode='r')
        if i==1:
            all_matrices = np.array(X)
        else:
            np.concatenate((all_matrices, np.array(X), axis=0) # check the concat axis!!
        del X
    results.append(np.nanmean(all_matrices))

Upvotes: 0

Related Questions