Reputation: 83

Memory error with large data sets for pandas.concat and numpy.append

I am facing a problem where I have to generate large DataFrames in a loop (50 iterations computing every time two 2000 x 800 pandas DataFrames). I would like to keep the results in memory in a bigger DataFrame, or in a dictionary like structure. When using pandas.concat, I get a memory error at some point in the loop. The same happens when using numpy.append to store the results in a dictionary of numpy arrays rather than in a DataFrame. In both cases, I still have a lot of available memory (several GB). Is this too much data for pandas or numpy to process? Are there more memory-efficient ways to store my data without saving it on disk?

As an example, the following script fails as soon as nbIds is greater than 376:

import pandas as pd
import numpy as np
nbIds = 376
dataids = range(nbIds)
dataCollection1 = []
dataCollection2 = []
for bs in range(50):
    newData1 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection1.append( newData1 )
    newData2 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection2.append( newData2 )
dataCollection1 = pd.concat(dataCollection1).reset_index(drop = True)
dataCollection2 = pd.concat(dataCollection2).reset_index(drop = True)

The code below fails when nbIds is 665 or higher

import pandas as pd
import numpy as np
nbIds = 665
dataids = range(nbIds)
dataCollection1 = dict( (i , np.array([])) for i in dataids )
dataCollection2 = dict( (i , np.array([])) for i in dataids )
for bs in range(50):
    newData1 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids )))
    newData1 = pd.DataFrame(newData1)
    newData2 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids)))
    newData2 = pd.DataFrame(newData2)
    for i in dataids :
        dataCollection1[i] = np.append(dataCollection1[i] , 
                                       np.array(newData1[i]))
        dataCollection2[i] = np.append(dataCollection2[i] , 
                                       np.array(newData2[i]))

I do need to compute both DataFrames everytime, and for each element i of dataids I need to obtain a pandas Series or a numpy array containing the 50 * 2000 numbers generated for i. Ideally, I need to be able to run this with nbIds equal to 800 or more. Is there a straightforward way of doing this?

I am using a 32-bit Python with Python 2.7.5, pandas 0.12.0 and numpy 1.7.1.

Thank you very much for your help!

Upvotes: 5

Answers (3)

Vidac

Reputation: 83

As suggested by usethedeathstar, Boud and Jeff in the comments, switching to a 64-bit python does the trick.
If losing precision is not an issue, using float32 data type as suggested by Jeff also increases the amount of data that can be processed in a 32-bit environment.

Upvotes: 2

Jeff

Reputation: 129048

This is essentially what you are doing. Note that it doesn't make much difference from a memory perspective if you do conversition to DataFrames before or after.

But you can specify dtype='float32' to effectively 1/2 your memory.

In [45]: np.concatenate([ np.random.uniform(size=2000 * 1000).astype('float32').reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[45]: 400000000

In [46]: np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[46]: 800000000

In [47]: DataFrame(np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]))
Out[47]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Columns: 1000 entries, 0 to 999
dtypes: float64(1000)

Upvotes: 6

tk.

Reputation: 626

A straightforward (but using the hard drive) way would be to simply use shelve (a hard drive dict): http://docs.python.org/2/library/shelve.html

Upvotes: 2

Memory error with large data sets for pandas.concat and numpy.append

Answers (3)

Related Questions