Reputation: 1960
I am trying to pull ~700k rows out of mysql into a Pandas dataframe.
I kept getting the same error over and over again:
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\mysql\connector\network.py",
line 245, in recv_plain read = self.sock.recv_into(packet_view, rest) ConnectionResetError: [WinError 10054]
An existing connection was forcibly closed by the remote host
Searching on StackOverflow, I found a great suggestion from ThePhysicist in another thread, so I modified my code as below. It won't run at all if the chunk size is over 200, and even if it is 200, the process will eventually throw up the "forcibly closed" error.
If anybody has any idea as to how to fix this, it would be very much appreciated. It is extremely frustrating as I never have similar issues when using R
Thanks
My current code (apologies for the formatting - can't figure out how to do it here):
from pandas import DataFrame
import time
import mysql.connector
import pandas as pd
import pandas.io.sql as psql
chunk_size = 300
offset = 0
cnx = mysql.connector.connect(user='xxx', password='xxx', database='xxx', host='xxx')
dfs = []
while True:
print(offset)
sql = "select a,b,c,d,e,f,g,h from table where a='xyz' order by b,c,d limit %d offset %d " % (chunk_size,offset)
dfs.append(psql.read_sql(sql, cnx))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
print("break")
break
full_df = pd.concat(dfs)
Explain Extended Returns:
select_type table type possible_keys key key_len ref rows filtered Extra
SIMPLE table ref idx_clientid,idx_char idx_char 48 const 1173586 100 Using index condition
When I move the code to the AWS server where the database resides, it runs fine with no issues. The issue seems to be when I run the code and the machine is not residing on AWS...
Upvotes: 1
Views: 1362
Reputation: 1270021
It sounds like you have a short time out period and are probably lacking appropriate indexes. I would suggest creating an index on (a, b, c, d)
:
create index idx_table_a_b_c_d on table(a, b, c, d);
This needs to be executed only once in the database (and that can be done through Python).
If this is not possible, then we might guess that the order by
is the time consuming part. To handle that, remove the order by
, load the data in Python, and then sort it in Python. This isn't my usual advice -- databases are better for data operations, but it might be necessary under some circumstances.
Upvotes: 1