How to Query and Export Large Data Set in Python Pandas

Question

I have SQL Server Database Table in Amazon RDS and I am running a python script on a 8 GB server in AWS EC2. The python code simply select all the data in a large table and tries to convert it into EC2. The EC2 instance quickly runs out of memory even though I am trying to extract the data yearly, however I would like all the data to be extracted into a csv (I don't necessarily need to use Pandas)

As of now the pandas dataframe code is very simple

query= 'select * from table_name'
df = pd.read_sql(query,cnxn)
df.to_csv(target_name, index=False)

The error I am seeing is

Traceback (most recent call last): df = pd.read_sql(query,cnxn)
MemoryError

AKX · Accepted Answer

You'll want to use your SQL database's native management tools instead of Python/Pandas here.

If it's a MySQL database,
mysql ... --batch --execute='select * from table_name' > my-file.csv
If it's a PostgreSQL database, within psql do something like
\copy (select * from table_name) To './my-file.csv' With CSV
If it's SQL Server, (via here)
sqlcmd -S MyServer -d myDB -E -Q "select * from table_name" -o "my-file.csv" -h-1 -s"," -w 700

If you really do want to use Pandas though, you might be able to get away with the chunksize parameter (adjust accordingly if you're running out of memory):

with open('my_csv.csv', 'w') as f:
    for i, partial_df in enumerate(pd.read_sql(query, cnxn, chunksize=100000)):
        print('Writing chunk %s' % i)
        partial_df.to_csv(f, index=False, header=(i == 0))

How to Query and Export Large Data Set in Python Pandas

Answers (2)

Related Questions