Reputation: 83
I am facing a problem where I have to generate large DataFrames in a loop (50 iterations computing every time two 2000 x 800 pandas DataFrames). I would like to keep the results in memory in a bigger DataFrame, or in a dictionary like structure. When using pandas.concat, I get a memory error at some point in the loop. The same happens when using numpy.append to store the results in a dictionary of numpy arrays rather than in a DataFrame. In both cases, I still have a lot of available memory (several GB). Is this too much data for pandas or numpy to process? Are there more memory-efficient ways to store my data without saving it on disk?
As an example, the following script fails as soon as nbIds
is greater than 376:
import pandas as pd
import numpy as np
nbIds = 376
dataids = range(nbIds)
dataCollection1 = []
dataCollection2 = []
for bs in range(50):
newData1 = pd.DataFrame( np.reshape(np.random.uniform(size =
2000 * len(dataids)),
(2000,len(dataids ))))
dataCollection1.append( newData1 )
newData2 = pd.DataFrame( np.reshape(np.random.uniform(size =
2000 * len(dataids)),
(2000,len(dataids ))))
dataCollection2.append( newData2 )
dataCollection1 = pd.concat(dataCollection1).reset_index(drop = True)
dataCollection2 = pd.concat(dataCollection2).reset_index(drop = True)
The code below fails when nbIds
is 665 or higher
import pandas as pd
import numpy as np
nbIds = 665
dataids = range(nbIds)
dataCollection1 = dict( (i , np.array([])) for i in dataids )
dataCollection2 = dict( (i , np.array([])) for i in dataids )
for bs in range(50):
newData1 = np.reshape(np.random.uniform(size = 2000 * len(dataids)),
(2000,len(dataids )))
newData1 = pd.DataFrame(newData1)
newData2 = np.reshape(np.random.uniform(size = 2000 * len(dataids)),
(2000,len(dataids)))
newData2 = pd.DataFrame(newData2)
for i in dataids :
dataCollection1[i] = np.append(dataCollection1[i] ,
np.array(newData1[i]))
dataCollection2[i] = np.append(dataCollection2[i] ,
np.array(newData2[i]))
I do need to compute both DataFrames everytime, and for each element i
of dataids
I need to obtain a pandas Series or a numpy array containing the 50 * 2000 numbers generated for i
. Ideally, I need to be able to run this with nbIds
equal to 800 or more.
Is there a straightforward way of doing this?
I am using a 32-bit Python with Python 2.7.5, pandas 0.12.0 and numpy 1.7.1.
Thank you very much for your help!
Upvotes: 5
Views: 18939
Reputation: 83
As suggested by usethedeathstar, Boud and Jeff in the comments, switching to a 64-bit python does the trick.
If losing precision is not an issue, using float32 data type as suggested by Jeff also increases the amount of data that can be processed in a 32-bit environment.
Upvotes: 2
Reputation: 129048
This is essentially what you are doing. Note that it doesn't make much difference from a memory perspective if you do conversition to DataFrames before or after.
But you can specify dtype='float32' to effectively 1/2 your memory.
In [45]: np.concatenate([ np.random.uniform(size=2000 * 1000).astype('float32').reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[45]: 400000000
In [46]: np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[46]: 800000000
In [47]: DataFrame(np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]))
Out[47]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Columns: 1000 entries, 0 to 999
dtypes: float64(1000)
Upvotes: 6
Reputation: 626
A straightforward (but using the hard drive) way would be to simply use shelve (a hard drive dict): http://docs.python.org/2/library/shelve.html
Upvotes: 2