Reputation: 45
I am trying to retrieve a large amount of data(more than 7 million) from database and trying to save a s flat file. The data is being retrieved using python code(python calls stored procedure). But I am having a problem here. The process is eating up so much of memory hence killing the process automatically by unix machine. I am using read_sql_query to read the data and to_csv to write into flat file. So, I wanted to ask if there is a way to solve this problem. May be reading only a few thousand rows at a time and saving them and go to next line. I even used chunksize parameter as well. But it does not seem to resolve the issue.
Any help or suggestion will be greatly appreciated.
Upvotes: 3
Views: 11946
Reputation: 4017
Pandas dataframe is awesome, and if the data is a timeseries and / or needs be modified I would use the read_sql_query()
as suggested by @PaSTE.
However if you decire to simply read from the database and imidiatly process the data into another format and you are happy by using some of the Python primitives, I would simply take advantage of the 'raw' build-in DB-API (same API Pandas is using) and read line-by-line like this:
import MySQLdb
db = MySQLdb.connect(host='hostname', user='john', passwd='doe', db='penguins')
cursor = db.cursor()
cursor.execute(f"SELECT * FROM your_table;")
for row in cursor:
print(row)
Or use chunks like this:
import MySQLdb
db = MySQLdb.connect(host='hostname', user='john', passwd='doe', db='penguins')
cursor = db.cursor()
cursor.execute("SELECT COUNT(*) FROM your_table")
row_count = cursor.fetchone()[0]
chunk_size = 1000
for offset in range(0, row_count, chunk_size):
cursor.execute(f"SELECT * FROM your_table LIMIT {chunk_size} OFFSET {offset};")
for row in cursor:
print(row)
This is not driver specific, and is possible with all drivers that I know of, so use which ever one you prefer.
Enjoy!
Upvotes: 3
Reputation: 4548
When you use chunksize
in read_sql_query
, you can iterate over the result to avoid loading everything into memory at once. However, you also have to write out to the CSV file in chunks to make sure you aren't just copying the results of the query into a new, gigantic DataFrame
chunk by chunk. Be careful to only write the column headers once. Here is an example using pandas:
import pandas as pd
dbcon = ... # whatever
with open("out.csv", "w") as fh:
chunks = pd.read_sql_query("SELECT * FROM table_name", dbcon, chunksize=10000)
next(chunks).to_csv(fh, index=False) # write the first chunk with the column names,
# but ignore the index (which will be screwed up anyway due to the chunking)
for chunk in chunks:
chunk.to_csv(fh, index=False, header=False) # skip the column names from now on
You don't have to ignore the index when writing the CSV if you explicitly set index_col
in the call to read_sql_query
.
Upvotes: 9
Reputation: 2459
Rather than using the pandas library, make a database connection directly (using psycopg2, pymysql, pyodbc, or other connector library as appropriate) and use Python's db-api to read and write rows concurrently, either one-by-one or in whatever size chunks you can handle.
Upvotes: 1