Reputation: 45

Reading and writing large volume of data in Python

I am trying to retrieve a large amount of data(more than 7 million) from database and trying to save a s flat file. The data is being retrieved using python code(python calls stored procedure). But I am having a problem here. The process is eating up so much of memory hence killing the process automatically by unix machine. I am using read_sql_query to read the data and to_csv to write into flat file. So, I wanted to ask if there is a way to solve this problem. May be reading only a few thousand rows at a time and saving them and go to next line. I even used chunksize parameter as well. But it does not seem to resolve the issue.

Any help or suggestion will be greatly appreciated.

Upvotes: 3

Answers (3)

chjortlund

Reputation: 4017

Pandas dataframe is awesome, and if the data is a timeseries and / or needs be modified I would use the read_sql_query() as suggested by @PaSTE.

However if you decire to simply read from the database and imidiatly process the data into another format and you are happy by using some of the Python primitives, I would simply take advantage of the 'raw' build-in DB-API (same API Pandas is using) and read line-by-line like this:

import MySQLdb

db = MySQLdb.connect(host='hostname', user='john', passwd='doe', db='penguins')
cursor = db.cursor()

cursor.execute(f"SELECT * FROM your_table;")

for row in cursor:
    print(row)

Or use chunks like this:

import MySQLdb

db = MySQLdb.connect(host='hostname', user='john', passwd='doe', db='penguins')
cursor = db.cursor()

cursor.execute("SELECT COUNT(*) FROM your_table")
row_count = cursor.fetchone()[0]
chunk_size = 1000

for offset in range(0, row_count, chunk_size):
    cursor.execute(f"SELECT * FROM your_table LIMIT {chunk_size} OFFSET {offset};")

    for row in cursor:
        print(row)

This is not driver specific, and is possible with all drivers that I know of, so use which ever one you prefer.

Enjoy!

Upvotes: 3

PaSTE

Reputation: 4548

When you use chunksize in read_sql_query, you can iterate over the result to avoid loading everything into memory at once. However, you also have to write out to the CSV file in chunks to make sure you aren't just copying the results of the query into a new, gigantic DataFrame chunk by chunk. Be careful to only write the column headers once. Here is an example using pandas:

import pandas as pd

dbcon = ... # whatever

with open("out.csv", "w") as fh:
    chunks = pd.read_sql_query("SELECT * FROM table_name", dbcon, chunksize=10000)
    next(chunks).to_csv(fh, index=False)  # write the first chunk with the column names,
                                          # but ignore the index (which will be screwed up anyway due to the chunking)
    for chunk in chunks:
        chunk.to_csv(fh, index=False, header=False) # skip the column names from now on

You don't have to ignore the index when writing the CSV if you explicitly set index_col in the call to read_sql_query.

Upvotes: 9

rd_nielsen

Reputation: 2459

Rather than using the pandas library, make a database connection directly (using psycopg2, pymysql, pyodbc, or other connector library as appropriate) and use Python's db-api to read and write rows concurrently, either one-by-one or in whatever size chunks you can handle.

Upvotes: 1

Reading and writing large volume of data in Python

Answers (3)

Related Questions