Reputation: 33
I have cgi page in python that interacts with data in SQL Server using pandas.
Summary
A query is run based on user interaction with some other data from another query. These are both loaded using the pandas.read_sql()
function. For some reason, the second query was running much slower than it should have been when comparing it in python to when it's directly run on the database (in SQL Server Management Studio). After some testing, I found out that the query is much slower when I pass the param using params=[p]
, which I was doing initially and would prefer to do, rather than inline in the query (code below). I'm not sure why this is so and thought someone out there might have an idea.
Code
#Method 1: using param=[]
query = "select * from FloorPlans where hydroid = ? order by plan_date desc"
t1 = datetime.datetime.now()
df2 = pd.read_sql(query, conn, params=[row["HydroID"]])
t2 = datetime.datetime.now()
print(t2-t1)
#Method 2: inline
query = "select * from FloorPlans where hydroid = '" + row["HydroID"] + "' order by plan_date desc"
t3 = datetime.datetime.now()
df2 = pd.read_sql(query, conn)
t4 = datetime.datetime.now()
print(t4-t3)
Times
Method 1: ~210.0 seconds
Method 2: ~0.05 seconds
In SSMS: ~0.04 seconds
Does anyone have any idea why this might be happening? I've checked to make sure the param method is sending a string as expected (by wrapping it in str()) and I've checked with a variety of values. I have a clustered index on the hydrioid column, but that shouldn't matter since it's the same value in all three cases. I also have two other queries doing pretty much the same thing (select * on varchar column with a clustered index) on other tables and they don't have this problem.
The only thing I have been able to think of so far is that in the FloorPlans table, hydrioid is always a sequence of numbers at the moment (this is likely to change in the future as other tables which contain the same identifier have instances with alphanumeric hydroids) and despite ensuring the variable is a string, something in pandas is converting it back to an int prior to sending to SQL, which causes issues with the query.
Upvotes: 3
Views: 3480
Reputation: 123739
If you are using Python_3 then all strings are Unicode. The Python code ...
sql = "SELECT * FROM MillionRows WHERE varchar_column = ?"
crsr.execute(sql, 'record012345')
... gets processed on the SQL Server as
declare @p1 int
set @p1=1
exec sp_prepexec @p1 output,N'@P1 nvarchar(24)',N'SELECT * FROM MillionRows WHERE varchar_column = @P1',N'record012345'
select @p1
Notice that the parameter value is Unicode: nvarchar(24)
Now if we check the actual execution plan of the equivalent query in SSMS ...
SELECT * FROM MillionRows WHERE varchar_column = N'record012345'
... we see
Physical operation: Index Scan
Actual Number of Rows: 1
Number of Rows Read: 1000000
On the other hand, if we run a query that uses a varchar
value ...
SELECT * FROM MillionRows WHERE varchar_column = 'record012345'
... the execution plan shows us
Physical operation: Index Seek
Actual Number of Rows: 1
Number of Rows Read: 1
The difference is due to the fact that the first query must do a Scan for the (implicitly converted) nvarchar
value against the varchar
index, while the second query is able to do a Seek instead of a Scan.
The fix for the raw Python code is to use setinputsizes to specify that the query parameter should be a varchar
...
sql = "SELECT * FROM MillionRows WHERE varchar_column = ?"
crsr.setinputsizes([(pyodbc.SQL_VARCHAR, 25)])
crsr.execute(sql, 'record012345')
.. which gets processed on the SQL Server as
declare @p1 int
set @p1=1
exec sp_prepexec @p1 output,N'@P1 varchar(25)',N'SELECT * FROM MillionRows WHERE varchar_column = @P1','record012345'
select @p1
A workaround for pandas read_sql_query
is to CAST
the parameter value in the SQL query itself
sql = "SELECT * FROM MillionRows WHERE varchar_column = CAST(? AS varchar(25))"
df = pd.read_sql_query(sql, engine, params=['record012345'])
Upvotes: 5