Steve Bronder
Steve Bronder

Reputation: 936

Memory error when using += one to large matrix

I'm attempting to write a collapsed Gibbs sampler in Python and am running into memory issues when creating initial values for one of my matrices. I am rather new to Python, so below is the outline of what I am doing with explanation. At 4 I receive my MemoryError

My goal is to:

  1. Create an T,M matrix of zeros (plus an alpha value), where T is some small number such as 2:6 and M can be very large

    import numpy as np
    import pandas as pd
    M = 500
    N = 10000
    T = 6
    alpha = .3
    NZM = np.zeros((T,M), dtype = np.float64) + alpha
    
  2. Create an M,N matrix of numbers generated by a multinomial distribution from T topics which would look like the following.

    Z = np.where(np.random.multinomial(1,[1./ntopics]*ntopics,size = M*N )==1)[1]
    Z
    
    array([[1, 3, 0, ..., 5, 3, 1],
           [3, 5, 0, ..., 5, 1, 2],
           [4, 5, 4, ..., 1, 3, 5],
           ..., 
           [1, 2, 1, ..., 0, 3, 4],
           [0, 5, 2, ..., 2, 5, 0],
           [2, 3, 2, ..., 4, 1, 5]])
    
  3. Create an index out of these using .reshape(M*N)

    Z_index = Z.reshape(M*N) 
    
    array([1, 3, 0, ..., 4, 1, 5])
    
  4. This step is where I receive my error. I Use Z_index to add one to each row of NZM that shows up as a value in Z. However, option 1 below is very slow while option 2 has a memory error.

    # Option 1
    for m in xrange(M):
        NZM[Z_index,m] += 1
    
    # Option 2
    NZM[Z_index,:] += 1  
    
    
    
    ---------------------------------------------------------------------------
    MemoryError                               Traceback (most recent call last)
    <ipython-input-88-087ab1ede05d> in <module>()
          2 # a memory error
          3 
    ----> 4 NZM[Z_index,:] += 1
    
    
    MemoryError: 
    

I want to add one to a row of this array each time it shows up in the Z_index. Is there a way to do this quickly and efficiently that I am unaware of? Thank you for taking the time to read this.

Upvotes: 0

Views: 207

Answers (1)

Steve Bronder
Steve Bronder

Reputation: 936

My question is a duplicate of the question here, however it arise from an inquiry which I think is unique and will be found more easily by people searching for an error caused by large duplicate indices.

So a simple sanity check shows that this is not doing what I thought it was doing. I assumed that, given an index with multiples of the same row, += would add one more to those rows for each time that row was present in the index.

import numpy as np
import pandas as pd

NWZ = np.zeros((10,10), dtype=np.float64) + 1

index = np.repeat([0,3], [1, 3], axis=0)

index

array([0, 3, 3, 3])

NWZ[index,:] += 1

NWZ

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.]])

We can see this is not the case as giving += multiple instances of the same row will only lead to the original row having one added to it. Because += performs 'in place' operations I assumed that this operation would return

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 4.,  4.,  4.,  4.,  4.],
       [ 1.,  1.,  1.,  1.,  1.]])

However by using .__iadd__(1) explicitly we see that addition is not performed cumulatively as it iterates through the index.

NWZ[index,:].__iadd__(1)

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.]])

You can go here for an intuitive explenation as to why this doesn't (and the user asserts shouldn't) happen.

An alternative solution to my problem is to first create a frequency table of the number of times row n shows up in my duplicate index. Then, since I'm only doing addition, add those frequencies to their corresponding rows.

from scipy.stats import itemfreq

index_counts = itemfreq(index)

N = len(index_counts[:,1])
NWZ[index_counts[:,0].astype(int),:] += index_counts[:,1].reshape(N,1)
NWZ

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 4.,  4.,  4.,  4.,  4.],
       [ 1.,  1.,  1.,  1.,  1.]])

Upvotes: 1

Related Questions