Python 2.7 pyodbc or pymssql vs R RODBC

Question

I'm trying to migrate R code to Python 2.7 in order to compare both. The first problem I get is when I try to do an odbc connection. R is much faster than python, but since I'm a newbie in Python I'm not sure if I'm using the right package.

In R I write:

ptm <- proc.time()
require(RODBC)
dbXX <- odbcDriverConnect('driver={SQL Server};
    server=s001111;database=XX;trusted_connection=true')
rech<-sqlQuery(dbXX, "select top 10000* from XX.dbo.table ", as.is=T)
proc.time() - ptm

and I get:

> proc.time() - ptm
   user  system elapsed 
   2.47    0.11    2.87

I have downloaded Anaconda for python 2.7 windows 7 64. So in Spyder I write:

import pyodbc
import pandas
from pandas.io.sql import read_frame
sql = 'select top 10000 * from XX.dbo.table'
cnn = pyodbc.connect('DRIVER={SQL Server};SERVER=s001111;DATABASE=XX;Trusted_Connection=yes')

start = time.time()
data=pd.read_sql(sql,cnn)
end = time.time()
print(end - start)

This takes 6.35 secs

I've also tried with pymssql:

import pymssql
conn = pymssql.connect(server='s001111', database='XX')  
start = time.time()
data=pd.read_sql(sql,conn)
end = time.time()
print(end - start)

This takes 38.3 secs!

The real query needs to read a table which dimension is 220.000 rows by 353 columns and apply a filter (with where).

I only need to extract data from the db.

Is there a way to do it faster in Python 2.7?

I've found pyodbc-memory usage relation caused by SQL Server but I guess that if it were an SQL problem would be doing the same in R, wouldn't it?

I've also found this: IOPro but it's not free!

At this point I was wondering if the problem was the ODBC connection or pandas itself, so I tried:

cur = conn.cursor();
start = time.time()
cur.execute(sql);
tabla=cur.fetchall() 
end = time.time()
print(end - start)

But it took 29.29 secs.

So, still: how is it possible that R is much faster than Python to retrieve data from an SQL Microsoft DB?

Python 2.7 pyodbc or pymssql vs R RODBC

Answers (1)

Related Questions