user2461769
user2461769

Reputation: 33

Why is pandas.read_sql with params so much slower than with inline params

I have cgi page in python that interacts with data in SQL Server using pandas.

Summary

A query is run based on user interaction with some other data from another query. These are both loaded using the pandas.read_sql() function. For some reason, the second query was running much slower than it should have been when comparing it in python to when it's directly run on the database (in SQL Server Management Studio). After some testing, I found out that the query is much slower when I pass the param using params=[p], which I was doing initially and would prefer to do, rather than inline in the query (code below). I'm not sure why this is so and thought someone out there might have an idea.

Code

#Method 1: using param=[] 
query = "select * from FloorPlans where hydroid = ? order by plan_date desc"
t1 = datetime.datetime.now()
df2 = pd.read_sql(query, conn, params=[row["HydroID"]])
t2 = datetime.datetime.now()
print(t2-t1)

#Method 2: inline
query = "select * from FloorPlans where hydroid = '" + row["HydroID"] + "' order by plan_date desc"
t3 = datetime.datetime.now()
df2 = pd.read_sql(query, conn)
t4 = datetime.datetime.now()
print(t4-t3)

Times

Method 1: ~210.0 seconds

Method 2: ~0.05 seconds

In SSMS: ~0.04 seconds


Does anyone have any idea why this might be happening? I've checked to make sure the param method is sending a string as expected (by wrapping it in str()) and I've checked with a variety of values. I have a clustered index on the hydrioid column, but that shouldn't matter since it's the same value in all three cases. I also have two other queries doing pretty much the same thing (select * on varchar column with a clustered index) on other tables and they don't have this problem.

The only thing I have been able to think of so far is that in the FloorPlans table, hydrioid is always a sequence of numbers at the moment (this is likely to change in the future as other tables which contain the same identifier have instances with alphanumeric hydroids) and despite ensuring the variable is a string, something in pandas is converting it back to an int prior to sending to SQL, which causes issues with the query.

Upvotes: 3

Views: 3480

Answers (1)

Gord Thompson
Gord Thompson

Reputation: 123739

If you are using Python_3 then all strings are Unicode. The Python code ...

sql = "SELECT * FROM MillionRows WHERE varchar_column = ?"
crsr.execute(sql, 'record012345')

... gets processed on the SQL Server as

declare @p1 int
set @p1=1
exec sp_prepexec @p1 output,N'@P1 nvarchar(24)',N'SELECT * FROM MillionRows WHERE varchar_column = @P1',N'record012345'
select @p1

Notice that the parameter value is Unicode: nvarchar(24)

Now if we check the actual execution plan of the equivalent query in SSMS ...

SELECT * FROM MillionRows WHERE varchar_column = N'record012345'

... we see

   Physical operation: Index Scan
Actual Number of Rows: 1
  Number of Rows Read: 1000000

On the other hand, if we run a query that uses a varchar value ...

SELECT * FROM MillionRows WHERE varchar_column = 'record012345'

... the execution plan shows us

   Physical operation: Index Seek
Actual Number of Rows: 1
  Number of Rows Read: 1

The difference is due to the fact that the first query must do a Scan for the (implicitly converted) nvarchar value against the varchar index, while the second query is able to do a Seek instead of a Scan.

The fix for the raw Python code is to use setinputsizes to specify that the query parameter should be a varchar ...

sql = "SELECT * FROM MillionRows WHERE varchar_column = ?"
crsr.setinputsizes([(pyodbc.SQL_VARCHAR, 25)])
crsr.execute(sql, 'record012345')

.. which gets processed on the SQL Server as

declare @p1 int
set @p1=1
exec sp_prepexec @p1 output,N'@P1 varchar(25)',N'SELECT * FROM MillionRows WHERE varchar_column = @P1','record012345'
select @p1

A workaround for pandas read_sql_query is to CAST the parameter value in the SQL query itself

sql = "SELECT * FROM MillionRows WHERE varchar_column = CAST(? AS varchar(25))"
df = pd.read_sql_query(sql, engine, params=['record012345'])

Upvotes: 5

Related Questions