Reputation: 29
I want to retrieve about 100 million rows and 30 columns of data from an SQL database into a dataframe where I can sort and filter based on certain requirements. I only have 2 Gig memory. Everything comes to a standstill even though I am using chunksize. Here is my code.
import pymysql
chunksize = 100
import pandas as pd
import pymysql.cursors
from urllib import parse```
sqlEngine = create_engine('mysql+pymysql://username:%s@localhost/db' % parse.unquote_plus('password'))
dbConnection = sqlEngine.connect()
for chunk in pd.read_sql("select * from db.db_table", dbConnection, chunksize = chunksize):
print(chunk)
Do somrthing with chunk(chunk is the dataframe that has all the 100 million columns )
I have reduced my chunksize but still not getting anything.
Upvotes: 1
Views: 2083
Reputation: 169012
To elaborate on my comment, something like this.
I foresee you're going to have a bad time trying to fit 100 million rows x 30 columns in 2 gigabytes of memory, though.
df = None
for offset in itertools.count(step=chunksize):
print("Reading chunk %d..." % offset)
query = "select * from db.db_table order by id limit %d offset %d" % (chunksize, offset)
chunk_df = pd.read_sql(query, dbConnection)
if not chunk_df: # TODO: this check might not be correct
# No data in new chunk, so we probably have it all
break
if not df:
df = chunk_df
else:
df = pd.concat([df, chunk_df], copy=False)
# do things with DF
Upvotes: 1