Reputation: 3781
I'm trying to migrate R code to Python 2.7 in order to compare both. The first problem I get is when I try to do an odbc connection. R is much faster than python, but since I'm a newbie in Python I'm not sure if I'm using the right package.
In R I write:
ptm <- proc.time()
require(RODBC)
dbXX <- odbcDriverConnect('driver={SQL Server};
server=s001111;database=XX;trusted_connection=true')
rech<-sqlQuery(dbXX, "select top 10000* from XX.dbo.table ", as.is=T)
proc.time() - ptm
and I get:
> proc.time() - ptm
user system elapsed
2.47 0.11 2.87
I have downloaded Anaconda for python 2.7 windows 7 64. So in Spyder I write:
import pyodbc
import pandas
from pandas.io.sql import read_frame
sql = 'select top 10000 * from XX.dbo.table'
cnn = pyodbc.connect('DRIVER={SQL Server};SERVER=s001111;DATABASE=XX;Trusted_Connection=yes')
start = time.time()
data=pd.read_sql(sql,cnn)
end = time.time()
print(end - start)
This takes 6.35 secs
I've also tried with pymssql:
import pymssql
conn = pymssql.connect(server='s001111', database='XX')
start = time.time()
data=pd.read_sql(sql,conn)
end = time.time()
print(end - start)
This takes 38.3 secs!
The real query needs to read a table which dimension is 220.000 rows by 353 columns and apply a filter (with where).
I only need to extract data from the db.
Is there a way to do it faster in Python 2.7?
I've found pyodbc-memory usage relation caused by SQL Server but I guess that if it were an SQL problem would be doing the same in R, wouldn't it?
I've also found this: IOPro but it's not free!
At this point I was wondering if the problem was the ODBC connection or pandas itself, so I tried:
cur = conn.cursor();
start = time.time()
cur.execute(sql);
tabla=cur.fetchall()
end = time.time()
print(end - start)
But it took 29.29 secs.
So, still: how is it possible that R is much faster than Python to retrieve data from an SQL Microsoft DB?
Upvotes: 2
Views: 5431
Reputation: 5950
Both RODBC and pyodbc will probably spend most of their time transferring data from the database server to the local computer you use to run your query. A very important parameter here is the number of rows moved from db server to local computer for each fetch loop. Both RODBC and pyodbc let you to configure this parameter.
Now, if you want to compare apples with apples:
I did use this approach to compare RODBC and pyodbc with three different database (not including SQL Server) and I always find the performances to be comparable.
Here is the R code:
library(RODBC)
ch <- odbcConnect("DWN",uid="xyz",pwd="xyz",rows_at_time=1024);
ds <- sqlQuery(ch,"select * from large_table limit 100000");
And here you have the python code:
>>> import pyodbc
>>> ch = pyodbc.connect('DSN=xyz;UID=xyz;PWD=xyz')
>>> curs = ch.cursor()
>>> curs.execute("select * from large_table limit 100000")
>>> while True:
... rows = curs.fetchmany(1024)
... if not rows:
... break
...
>>>
Upvotes: 3