Reputation: 936
I'm attempting to write a collapsed Gibbs sampler in Python and am running into memory issues when creating initial values for one of my matrices. I am rather new to Python, so below is the outline of what I am doing with explanation. At 4 I receive my MemoryError
My goal is to:
Create an T,M matrix of zeros (plus an alpha value), where T is some small number such as 2:6 and M can be very large
import numpy as np
import pandas as pd
M = 500
N = 10000
T = 6
alpha = .3
NZM = np.zeros((T,M), dtype = np.float64) + alpha
Create an M,N matrix of numbers generated by a multinomial distribution from T topics which would look like the following.
Z = np.where(np.random.multinomial(1,[1./ntopics]*ntopics,size = M*N )==1)[1]
Z
array([[1, 3, 0, ..., 5, 3, 1],
[3, 5, 0, ..., 5, 1, 2],
[4, 5, 4, ..., 1, 3, 5],
...,
[1, 2, 1, ..., 0, 3, 4],
[0, 5, 2, ..., 2, 5, 0],
[2, 3, 2, ..., 4, 1, 5]])
Create an index out of these using .reshape(M*N)
Z_index = Z.reshape(M*N)
array([1, 3, 0, ..., 4, 1, 5])
This step is where I receive my error. I Use Z_index to add one to each row of NZM that shows up as a value in Z. However, option 1 below is very slow while option 2 has a memory error.
# Option 1
for m in xrange(M):
NZM[Z_index,m] += 1
# Option 2
NZM[Z_index,:] += 1
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-88-087ab1ede05d> in <module>()
2 # a memory error
3
----> 4 NZM[Z_index,:] += 1
MemoryError:
I want to add one to a row of this array each time it shows up in the Z_index. Is there a way to do this quickly and efficiently that I am unaware of? Thank you for taking the time to read this.
Upvotes: 0
Views: 207
Reputation: 936
My question is a duplicate of the question here, however it arise from an inquiry which I think is unique and will be found more easily by people searching for an error caused by large duplicate indices.
So a simple sanity check shows that this is not doing what I thought it was doing. I assumed that, given an index with multiples of the same row, += would add one more to those rows for each time that row was present in the index.
import numpy as np
import pandas as pd
NWZ = np.zeros((10,10), dtype=np.float64) + 1
index = np.repeat([0,3], [1, 3], axis=0)
index
array([0, 3, 3, 3])
NWZ[index,:] += 1
NWZ
array([[ 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1.]])
We can see this is not the case as giving += multiple instances of the same row will only lead to the original row having one added to it. Because +=
performs 'in place' operations I assumed that this operation would return
array([[ 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 4., 4., 4., 4., 4.],
[ 1., 1., 1., 1., 1.]])
However by using .__iadd__(1)
explicitly we see that addition is not performed cumulatively as it iterates through the index.
NWZ[index,:].__iadd__(1)
array([[ 2., 2., 2., 2., 2.],
[ 2., 2., 2., 2., 2.],
[ 2., 2., 2., 2., 2.],
[ 2., 2., 2., 2., 2.]])
You can go here for an intuitive explenation as to why this doesn't (and the user asserts shouldn't) happen.
An alternative solution to my problem is to first create a frequency table of the number of times row n
shows up in my duplicate index. Then, since I'm only doing addition, add those frequencies to their corresponding rows.
from scipy.stats import itemfreq
index_counts = itemfreq(index)
N = len(index_counts[:,1])
NWZ[index_counts[:,0].astype(int),:] += index_counts[:,1].reshape(N,1)
NWZ
array([[ 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 4., 4., 4., 4., 4.],
[ 1., 1., 1., 1., 1.]])
Upvotes: 1