EngineJanwaar
EngineJanwaar

Reputation: 450

How to Query and Export Large Data Set in Python Pandas

I have SQL Server Database Table in Amazon RDS and I am running a python script on a 8 GB server in AWS EC2. The python code simply select all the data in a large table and tries to convert it into EC2. The EC2 instance quickly runs out of memory even though I am trying to extract the data yearly, however I would like all the data to be extracted into a csv (I don't necessarily need to use Pandas)

As of now the pandas dataframe code is very simple

query= 'select * from table_name'
df = pd.read_sql(query,cnxn)
df.to_csv(target_name, index=False)

The error I am seeing is

Traceback (most recent call last): df = pd.read_sql(query,cnxn)
MemoryError

Upvotes: 1

Views: 5144

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148910

If you can use read_sql with pandas, you certainly have a driver that allows you to directly query the database with a DB-API 2.0 interface, and then write with the csv module one record at a time:

con = ...   # it depends on your currect driver
curs = con.cursor()
curs.execute('select * from table_name')
with open(target_name, 'w', newline='') as fd:
    wr = csv.writer(fd)
    wr.writerow([d[0] for d in curs.description])    # write the header line
    while True:                                      # loop on the cursor
        row = curs.fetchone()
        if not row: break                            # until the end of rows
        wr.writerow(row)                             # and write the row
curs.close()
con.close()

Upvotes: 0

AKX
AKX

Reputation: 168986

You'll want to use your SQL database's native management tools instead of Python/Pandas here.

  • If it's a MySQL database,
    mysql ... --batch --execute='select * from table_name' > my-file.csv
  • If it's a PostgreSQL database, within psql do something like
    \copy (select * from table_name) To './my-file.csv' With CSV
  • If it's SQL Server, (via here)
    sqlcmd -S MyServer -d myDB -E -Q "select * from table_name" -o "my-file.csv" -h-1 -s"," -w 700

If you really do want to use Pandas though, you might be able to get away with the chunksize parameter (adjust accordingly if you're running out of memory):

with open('my_csv.csv', 'w') as f:
    for i, partial_df in enumerate(pd.read_sql(query, cnxn, chunksize=100000)):
        print('Writing chunk %s' % i)
        partial_df.to_csv(f, index=False, header=(i == 0))

Upvotes: 4

Related Questions