RC0706
RC0706

Reputation: 25

Out of Memory:Transferring Large Data from Amazon Redshift to Pandas

I have a large data chunk(about 10M rows) in Amazon-Redishift, that I was to obtain in a Pandas data-frame and store the data in a pickle file. However, it shows "Out of Memory" exception for obvious reasons, because of the size of data. I tried a lot other things like sqlalchemy, however, not able to crack the Problem. Can anyone suggest a better way or code to get through it.

My current (simple) code snippet goes as below:

import psycopg2 

import pandas as pd

import numpy as np

cnxn = psycopg2.connect(dbname=<mydatabase>, host='my_redshift_Server_Name', port='5439', user=<username>, password=<pwd>)

sql = "Select * from mydatabase.mytable" 

df = pd.read_sql(sql, cnxn, columns=1)

pd.to_pickle(df, 'Base_Data.pkl')

print(df.head(50))

cnxn.close()

print(df.head(50))

Upvotes: 0

Views: 2603

Answers (2)

Eugene Pakhomov
Eugene Pakhomov

Reputation: 10632

If you're using pickle to just transfer the data somewhere else, I'd repeat the suggestion from AlexYes's answer - just use S3.

But if you want to be able to work with the data locally, you have to limit yourself to the algorithms that do not require all data to work. In this case, I would suggest something like HDF5 or Parquet for data storage and Dask for data processing since it doesn't require all the data to reside in memory - it can work in chunks and in parallel. You can migrate your data from Redshift using this code:

from dask import dataframe as dd

d = dd.read_sql_table(my_table, my_db_url, index_col=my_table_index_col)
d.to_hdf('Base_Data.hd5', key='data')

Upvotes: 0

AlexYes
AlexYes

Reputation: 4208

1) find the row count in the table and the maximum chunk of the table that you can pull by adding order by [column] limit [number] offset 0 and increasing the limit number reasonably

2) add a loop that will produce the sql with the limit that you found and increasing offset, i.e. if you can pull 10k rows your statements would be:

... limit 10000 offset 0;
... limit 10000 offset 10000;
... limit 10000 offset 20000;

until you reach the table row count

3) in the same loop, append every new obtained set of rows to your dataframe.

p.s. this will work assuming you won't run into any issues with memory/disk on client end which I can't guarantee since you have such issue on a cluster which is likely higher grade hardware. To avoid the problem you would just write a new file on every iteration instead of appending.

Also, the whole approach is probably not right. You'd better unload the table to S3 which is pretty quick because the data is copied from every node independently, and then do whatever needed against the flat file on S3 to transform it to the final format you need.

Upvotes: 1

Related Questions