Reputation: 724
Is there a difference in relation to time execution between this two commands :
import pandas as pd
df=pd.read_sql_query('SELECT * FROM TABLE',conn)
df=pd.read_sql_table(TABLE, conn)
Thank you for your help
Upvotes: 19
Views: 28473
Reputation: 821
I tried this countless times and, despite what I read above, I do not agree with most of either the process or the conclusion.
If you're to compare two methods, adding thick layers of SQLAlchemy
or pandasSQL_builder
(that is pandas.io.sql.pandasSQL_builder
, without so much as an import
) and other such non self-contained fragments is not helpful to say the least. The only way to compare two methods without noise is to just use them as clean as possible and, at the very least, in similar circumstances.
There is a saying about assumptions... Between assuming the difference is not noticeable and bringing up useless considerations about pd.read_sql_query
, the point gets severely blurred. The only obvious consideration here is that if anyone is comparing pd.read_sql_query
and pd.read_sql_table
, it's the table, the whole table and nothing but the table. Invoking where
, join
and others is just a waste of time.
Furthermore, the question explicitly asks for the difference between read_sql_table
and read_sql_query
with a SELECT * FROM table
.
I ran this over and over again on SQLite, MariaDB and PostgreSQL. I use SQLAlchemy
exclusively to create the engines, because pandas
requires this. The data comes from the coffee-quality-database and I preloaded the file data/arabica_data_cleaned.csv
in all three engines, to a table called arabica
in a DB called coffee
Here's a summarised version of my script:
import time
import pandas as pd
from sqlalchemy import create_engine
sqlite_engine = create_engine('sqlite:///coffee.db', echo=False)
mariadb_engine = create_engine('mariadb+mariadbconnector://root:[email protected]:3306/coffee')
postgres_engine = create_engine('postgresql://postgres:[email protected]:5432/coffee')
for engine in [sqlite_engine, mariadb_engine, postgres_engine]:
print(engine)
print('\tpd.read_sql_query:')
startTime = time.time()
for i in range(100):
pd.read_sql_query('SELECT * FROM arabica;', engine)
print(f"\t[-- TIME --] {time.time()-startTime:.2f} sec\n")
print('\tpd.read_sql_table:')
startTime = time.time()
for i in range(100):
pd.read_sql_table('arabica', engine)
print(f"\t[-- TIME --] {time.time()-startTime:.2f} sec\n")
The versions are:
Here's a sample output:
Engine(sqlite:///coffee.db)
pd.read_sql_query:
[-- TIME --] 2.58 sec
pd.read_sql_table:
[-- TIME --] 3.60 sec
Engine(mariadb+mariadbconnector://root:***@127.0.0.1:3306/coffee)
pd.read_sql_query:
[-- TIME --] 2.84 sec
pd.read_sql_table:
[-- TIME --] 4.15 sec
Engine(postgresql://postgres:***@127.0.0.1:5432/coffee)
pd.read_sql_query:
[-- TIME --] 2.18 sec
pd.read_sql_table:
[-- TIME --] 4.01 sec
The above are a sample output, but I ran this over and over again and the only observation is that in every single run, pd.read_sql_table
ALWAYS takes longer than pd.read_sql_query
. This sounds very counter-intuitive, but that's why we actually isolate the issue and test prior to pouring knowledge here.
I haven't had the chance to run a proper statistical analysis on the results, but at first glance, I would risk stating that the differences are significant, as both "columns" (query
and table
timings) come back within close ranges (from run to run) and are both quite distanced. In some runs, table
takes twice the time for some of the engines.
If/when I get the chance to run such an analysis, I will complement this answer with results and a matplotlib
evidence.
My initial idea was to investigate the suitability of SQL vs. MongoDB when tables reach thousands of columns. pdmongo.read_mongo
(from the pdmongo
package) devastates pd.read_sql_table
— which performs very poorly against large tables — but falls short of pd.read_sql_query
.
With around 900 columns, pd.read_sql_query
outperforms pd.read_sql_table
by 5 to 10 times!
Upvotes: 9
Reputation: 21
In read_sql_query you can add where clause, you can add joins etc. and that way reduce the amount of data you move from the database into your data frame. It is better if you have a huge table and you need only small number of rows.
On the other hand, if your table is small, use read_sql_table and just manipulate the data frame in python. It's more flexible than SQL.
Upvotes: 2
Reputation: 9
The main difference is obvious,
with
df=pd.read_sql_query('SELECT * FROM TABLE',conn)
you use sql query that can be complex and hence execution can get very time/recources consuming.
With
df=pd.read_sql_table(TABLE, conn)
you download a table and specify only columns, schema etc.
Upvotes: 0
Reputation: 210972
I don't think you will notice this difference.
Here is a source code for both functions:
In [398]: pd.read_sql_query??
Signature: pd.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None)
Source:
def read_sql_query(sql, con, index_col=None, coerce_float=True, params=None,
parse_dates=None, chunksize=None):
pandas_sql = pandasSQL_builder(con)
return pandas_sql.read_query(
sql, index_col=index_col, params=params, coerce_float=coerce_float,
parse_dates=parse_dates, chunksize=chunksize)
and
In [399]: pd.read_sql_table??
Signature: pd.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None
)
Source:
def read_sql_table(table_name, con, schema=None, index_col=None,
coerce_float=True, parse_dates=None, columns=None,
chunksize=None):
con = _engine_builder(con)
if not _is_sqlalchemy_connectable(con):
raise NotImplementedError("read_sql_table only supported for "
"SQLAlchemy connectable.")
import sqlalchemy
from sqlalchemy.schema import MetaData
meta = MetaData(con, schema=schema)
try:
meta.reflect(only=[table_name], views=True)
except sqlalchemy.exc.InvalidRequestError:
raise ValueError("Table %s not found" % table_name)
pandas_sql = SQLDatabase(con, meta=meta)
table = pandas_sql.read_table(
table_name, index_col=index_col, coerce_float=coerce_float,
parse_dates=parse_dates, columns=columns, chunksize=chunksize)
if table is not None:
return table
else:
raise ValueError("Table %s not found" % table_name, con)
NOTE: i have iintentionally cut off docstrings...
Upvotes: 2