Memory error when using += one to large matrix

Question

I'm attempting to write a collapsed Gibbs sampler in Python and am running into memory issues when creating initial values for one of my matrices. I am rather new to Python, so below is the outline of what I am doing with explanation. At 4 I receive my MemoryError

My goal is to:

Create an T,M matrix of zeros (plus an alpha value), where T is some small number such as 2:6 and M can be very large

import numpy as np
import pandas as pd
M = 500
N = 10000
T = 6
alpha = .3
NZM = np.zeros((T,M), dtype = np.float64) + alpha

Create an M,N matrix of numbers generated by a multinomial distribution from T topics which would look like the following.

Z = np.where(np.random.multinomial(1,[1./ntopics]*ntopics,size = M*N )==1)[1]
Z

array([[1, 3, 0, ..., 5, 3, 1],
       [3, 5, 0, ..., 5, 1, 2],
       [4, 5, 4, ..., 1, 3, 5],
       ..., 
       [1, 2, 1, ..., 0, 3, 4],
       [0, 5, 2, ..., 2, 5, 0],
       [2, 3, 2, ..., 4, 1, 5]])

Create an index out of these using .reshape(M*N)

Z_index = Z.reshape(M*N) 

array([1, 3, 0, ..., 4, 1, 5])

This step is where I receive my error. I Use Z_index to add one to each row of NZM that shows up as a value in Z. However, option 1 below is very slow while option 2 has a memory error.

# Option 1
for m in xrange(M):
    NZM[Z_index,m] += 1

# Option 2
NZM[Z_index,:] += 1  



---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
 in ()
      2 # a memory error
      3 
----> 4 NZM[Z_index,:] += 1


MemoryError:

I want to add one to a row of this array each time it shows up in the Z_index. Is there a way to do this quickly and efficiently that I am unaware of? Thank you for taking the time to read this.

Steve Bronder · Accepted Answer

My question is a duplicate of the question here, however it arise from an inquiry which I think is unique and will be found more easily by people searching for an error caused by large duplicate indices.

So a simple sanity check shows that this is not doing what I thought it was doing. I assumed that, given an index with multiples of the same row, += would add one more to those rows for each time that row was present in the index.

import numpy as np
import pandas as pd

NWZ = np.zeros((10,10), dtype=np.float64) + 1

index = np.repeat([0,3], [1, 3], axis=0)

index

array([0, 3, 3, 3])

NWZ[index,:] += 1

NWZ

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.]])

We can see this is not the case as giving += multiple instances of the same row will only lead to the original row having one added to it. Because += performs 'in place' operations I assumed that this operation would return

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 4.,  4.,  4.,  4.,  4.],
       [ 1.,  1.,  1.,  1.,  1.]])

However by using .__iadd__(1) explicitly we see that addition is not performed cumulatively as it iterates through the index.

NWZ[index,:].__iadd__(1)

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.]])

You can go here for an intuitive explenation as to why this doesn't (and the user asserts shouldn't) happen.

An alternative solution to my problem is to first create a frequency table of the number of times row n shows up in my duplicate index. Then, since I'm only doing addition, add those frequencies to their corresponding rows.

from scipy.stats import itemfreq

index_counts = itemfreq(index)

N = len(index_counts[:,1])
NWZ[index_counts[:,0].astype(int),:] += index_counts[:,1].reshape(N,1)
NWZ

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 4.,  4.,  4.,  4.,  4.],
       [ 1.,  1.,  1.,  1.,  1.]])

Memory error when using += one to large matrix

Answers (1)

Related Questions