trench
trench

Reputation: 5355

Adjusting a for loop into a list comprehension

I loop through a list of tables and dates from a database to gather data. Something like this:

df_list = []
for table in table_list:
    for date in required_date_range:
        query = 'SELECT * FROM {} WHERE row_date = {};'.format(table, date)
        df = pd.read_sql_query(sql=query, con=engine)
        df_list.append(df)

result = pd.concat(df_list)

Is there a way to put a loop like that into a list comprehension? Is it even worth it?

I found some example code from https://tomaugspurger.github.io/modern-4-performance.html

files = glob.glob('weather/*.csv')
weather_dfs = [pd.read_csv(fp, names=columns) for fp in files]
weather = pd.concat(weather_dfs)

It looks better and the charts show it performs better but I just can't seem to wrap my head around it when I try to adjust my own code.

Edit-

It seems to work if I make a list of the queries instead. Is there a way to get that initial for loop and .format into a list comprehension as well?

for table in table_list:
    for date in required_date_range:
        queries = ['SELECT * FROM {} WHERE row_date = {};'.format(table, date)]

dfs = [pd.read_sql_query(query, con=pg_engine) for query in queries]

Upvotes: 3

Views: 436

Answers (2)

root
root

Reputation: 33793

I don't think a list comprehension by itself would give you a significant performance boost. I mean, it might give you a slight performance increase compared to a loop, but I don't think it'd be significant relative to the other things that need to be done, e.g. querying the database, initializing the dataframe, concating.

What could potentially give you a performance boost is eliminating your inner loop by using the SQL IN operator:

SELECT * FROM table_name WHERE row_date IN (date1, date2, date3,...);

So, that would change your loop to something like:

df_list = []
for table in table_list:
    query = 'SELECT * FROM {} WHERE row_date IN ({});'.format(table, ','.join(date_range))
    df = pd.read_sql_query(sql=query, con=engine)
    df_list.append(df)

From there it's fairly straightforward to convert it to a comprehension:

query = 'SELECT * FROM {} WHERE row_date IN ({});'
dfs = (pd.read_sql_query(sql=query.format(table, ','.join(date_range)), con=engine) for table in table_list)
df = pd.concat(dfs)

If the columns from each table are identical and in the same order, you could even eliminate the table loop by using UNION ALL to build a single query along the lines of:

SELECT * FROM table1 WHERE row_date IN (date1, date2, date3,...)
UNION ALL
SELECT * FROM table2 WHERE row_date IN (date1, date2, date3,...)
UNION ALL
...

And then just do a single read_sql_query call:

df = pd.read_sql_query(sql=union_all_query, con=engine)

Upvotes: 4

piRSquared
piRSquared

Reputation: 294248

I think this should work

def q(table, date):
    query = 'SELECT * FROM {} WHERE row_date = {};'.format
    return pd.read_sql_query(sql=query(table, date), con=engine)

df_list = [q(table, date) for table in table_list for date in required_date_range]

Dmonstration
Note: I switched to returning just the query as this is a demonstration and I don't have your database connections.

table_list = ['table1', 'table2']
required_date_range = ['date1', 'date2']

def q(table, date):
    query = 'SELECT * FROM {} WHERE row_date = {};'.format
    return query(table, date)

df_list = [q(table, date) for table in table_list for date in required_date_range]

df_list

['SELECT * FROM table1 WHERE row_date = date1;',
 'SELECT * FROM table1 WHERE row_date = date2;',
 'SELECT * FROM table2 WHERE row_date = date1;',
 'SELECT * FROM table2 WHERE row_date = date2;']

Upvotes: 2

Related Questions