Henry
Henry

Reputation: 23

How do I avoid memory error in EC2 when loading a huge table in Pandas dataframe?

I tried to connect to redshift and load my huge fact table into pandas dataframe like below, and I always encounter memory error when I execute the script. I am thinking either the loading by chunk part is incorrect or I shouldn't load the whole fact table into dataframe at all. Can someone point me the right direction?

conn = psycopg2.connect(dbname='', user='', host='', port='',
                        password='')
df = pd.DataFrame()

for chunk in pd.read_sql(
        "select * from MyFactTable ",
        con=conn, chunksize=1000):
    df = df.append(chunk)

Upvotes: 0

Views: 278

Answers (1)

Jiří Baum
Jiří Baum

Reputation: 6930

Yeah, the df = df.append(chunk) part means that you're loading the whole table into memory at once, so you might as well not chunk.

If possible, you need to process each chunk separately; depending on your calculation, that might be easy or difficult. If possible, you should also push as much of the processing as possible into the SQL query (eg. if you only need some rows, use a WHERE clause).

PS: Depending on how large the table is and what you need to do with it, potentially you could also choose an EC2 instance with more memory; not an elegant solution, but sometimes throwing resources at a problem works...

Upvotes: 1

Related Questions