Roberto
Roberto

Reputation: 41

read_sql() from MySQL is extremely slow

I am using MySQL with pandas and sqlalchemy. However, it is extremely slow. A simple query as this one takes more than 11 minutes to complete on a table with 11 milion rows. What actions could improve this performance? The table mentioned does not have a primary key and was indexed only by one column.

from sqlalchemy import create_engine
import pandas as pd
sql_engine_access = 'mysql+pymysql://root:[password]@localhost')
sql_engine = create_engine(sql_engine_access, echo=False)
script = 'select * from my_database.my_table'
df = pd.read_sql(script, con=self.sql_engine)

Upvotes: 4

Views: 4003

Answers (2)

Xiaoying Wang
Xiaoying Wang

Reputation: 349

You can try out our tool connectorx (pip install -U connectorx). It is implemented in Rust and targeting on improving the performance of pandas.read_sql. The API is basically the same with pandas. For example in your case the code would look like:

import connectorx as cx
conn_url = "mysql://root:[password]@localhost:port/my_database"
query = "select * from my_table"
df = cx.read_sql(conn_url, query)

If there is a numerical column that is evenly distributed like ID in your query result, you can also further speed up the process by leveraging multiple cores like this:

df = cx.read_sql(conn_url, query, partition_on="ID", partition_num=4)

This would split the entire query to four small ones by filtering on the ID column and connectorx will run them in parallel. You can check out here for more usage and examples.

Here is the benchmark result loading 60M rows x 16 columns from MySQL to pandas DataFrame using 4 cores: mysql time mysql memory

Upvotes: 4

Gord Thompson
Gord Thompson

Reputation: 123739

While perhaps not the entire cause of the slow performance, one contributing factor would be that PyMySQL (mysql+pymysql://) can be significantly slower than mysqlclient (mysql+mysqldb://) under heavy loads. In a very informal test (no multiple runs, no averaging, no server restarts) I saw the following results using df.read_sql_query() against a local MySQL database:

rows retrieved mysql+mysqldb (seconds) mysql+pymysql (seconds)
1_000_000 13.6 54.0
2_000_000 25.9 114.1
3_000_000 38.9 171.5
4_000_000 62.8 217.0
5_000_000 78.3 277.4

chart.png

Upvotes: 2

Related Questions