Getting Large Dataset Out of MySQL into Pandas Dataframe keeps Failing , Even With Chunksize

I am trying to pull ~700k rows out of mysql into a Pandas dataframe.

I kept getting the same error over and over again:

Traceback (most recent call last):

File "C:\Anaconda3\lib\site-packages\mysql\connector\network.py",

line 245, in recv_plain read = self.sock.recv_into(packet_view, rest) ConnectionResetError: [WinError 10054]

An existing connection was forcibly closed by the remote host

Searching on StackOverflow, I found a great suggestion from ThePhysicist in another thread, so I modified my code as below. It won't run at all if the chunk size is over 200, and even if it is 200, the process will eventually throw up the "forcibly closed" error.

If anybody has any idea as to how to fix this, it would be very much appreciated. It is extremely frustrating as I never have similar issues when using R

Thanks

My current code (apologies for the formatting - can't figure out how to do it here):

from pandas import DataFrame
import time
import mysql.connector
import pandas as pd
import pandas.io.sql as psql

chunk_size = 300
offset = 0

cnx = mysql.connector.connect(user='xxx', password='xxx', database='xxx', host='xxx')
dfs = []

while True:
  print(offset)
  sql = "select a,b,c,d,e,f,g,h from table where a='xyz' order by b,c,d limit %d offset %d " % (chunk_size,offset) 
  dfs.append(psql.read_sql(sql, cnx))
  offset += chunk_size
  if len(dfs[-1]) < chunk_size:
    print("break")
    break

full_df = pd.concat(dfs)

Explain Extended Returns:

select_type table   type    possible_keys   key key_len ref rows    filtered    Extra
SIMPLE  table   ref idx_clientid,idx_char   idx_char    48  const   1173586 100 Using index condition

When I move the code to the AWS server where the database resides, it runs fine with no issues. The issue seems to be when I run the code and the machine is not residing on AWS...

Upvotes: 1

Views: 1362

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270021

It sounds like you have a short time out period and are probably lacking appropriate indexes. I would suggest creating an index on (a, b, c, d):

create index idx_table_a_b_c_d on table(a, b, c, d);

This needs to be executed only once in the database (and that can be done through Python).

If this is not possible, then we might guess that the order by is the time consuming part. To handle that, remove the order by, load the data in Python, and then sort it in Python. This isn't my usual advice -- databases are better for data operations, but it might be necessary under some circumstances.

Upvotes: 1

Related Questions