Can SQLAlchemy be used to manipulate Pandas dataframes in Python?

Question

Rather than learn Pandas syntax, which would take some time, I'd like to use my existing SQL skills to manipulate Pandas dataframes in Python.

My data are already in Pandas dataframes for various reasons.

If someone knows the answer, can you show me a simple example to get me started? Something like the following, I'm guessing:

import sqlalchemy

query = """
    select 
        trunc(date, 'MM') as month
        , avg(case when hour_ending > 7 and hour_ending < 24 then volume else null end) as volume_peak
        , avg(case when hour_ending <= 7 or hour_ending = 24 then volume else null end) as volume_off_peak
    from pandas_df
    where date >= add_months(trunc(sysdate, 'DD'), -2)
    group by trunc(date, 'MM')
    order by trunc(date, 'MM');
"""

new_df = sqlalchemy.run_query(query)

Thanks! Sean

Parfait · Accepted Answer

Meanwhile, consider learning pandas's conditional logic/filter/aggregation methods which are translatable from your SQL skills:

trunc(..., 'MM') ---> .dt.month
CASE WHEN ---> df.loc or np.where()
WHERE ---> []
GROUP BY ---> .groupby
AVG --> .agg('mean')

Here is an example

Random Data (seeded for reproducibility)

import numpy as np
import pandas as pd
import datetime as dt
import time

epoch_time = int(time.time())

np.random.seed(101)
pandas_df = pd.DataFrame({'date': [dt.datetime.fromtimestamp(np.random.randint(1480000000, epoch_time)) for _ in range(50)],
                          'hour_ending': [np.random.randint(14) for _ in range(50)],
                          'volume': abs(np.random.randn(50)*100)})

# OUTPUT CSV AND IMPORT INTO DATABASE TO TEST RESULT
pandas_df.to_csv('Output.csv', index=False)

print(pandas_df.head(10))

#                  date  hour_ending      volume
# 0 2017-01-01 20:05:19           10   56.660415
# 1 2017-09-02 00:56:27            3   79.060800
# 2 2018-01-04 09:25:05            7   23.076240
# 3 2016-11-27 23:44:55            6  102.801241
# 4 2017-01-29 12:19:55            5   88.824230
# 5 2017-04-15 15:16:09            6  214.168659
# 6 2017-09-07 08:12:45            9   97.607635
# 7 2017-12-31 15:35:36           13  141.467249
# 8 2017-04-21 23:01:44           13  156.246854
# 9 2016-12-22 09:27:49            2   67.646662

Aggregation (calculates columns prior to summarizing)

# CALCULATE GROUP MONTH COLUMN
pandas_df['month'] = pandas_df['date'].dt.month

# CONDITIONAL LOGIC COLUMNS
pandas_df.loc[pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_peak'] = pandas_df['volume'] 
pandas_df.loc[~pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_off_peak'] = pandas_df['volume'] 

print(pandas_df.head(10))
#                  date  hour_ending      volume  month  volume_peak  volume_off_peak
# 0 2017-01-01 20:05:19           10   56.660415      1    56.660415              NaN
# 1 2017-09-02 00:56:27            3   79.060800      9          NaN        79.060800
# 2 2018-01-04 09:25:05            7   23.076240      1          NaN        23.076240
# 3 2016-11-27 23:44:55            6  102.801241     11          NaN       102.801241
# 4 2017-01-29 12:19:55            5   88.824230      1          NaN        88.824230
# 5 2017-04-15 15:16:09            6  214.168659      4          NaN       214.168659
# 6 2017-09-07 08:12:45            9   97.607635      9    97.607635              NaN
# 7 2017-12-31 15:35:36           13  141.467249     12   141.467249              NaN
# 8 2017-04-21 23:01:44           13  156.246854      4   156.246854              NaN
# 9 2016-12-22 09:27:49            2   67.646662     12          NaN        67.646662

# WHERE AND GROUPBY
agg_df = pandas_df[pandas_df['date'] >= (dt.datetime.today() - dt.timedelta(days=60))]\
                   .groupby('month')[['volume_peak', 'volume_off_peak']].agg('mean')

print(agg_df)
#        volume_peak  volume_off_peak
# month                              
# 1        62.597999        23.076240
# 11       37.775000        17.075594
# 12      141.063694        29.986261

SQL (using MS Access that imported csv from above code)

SELECT Month(p.date) As month, 
       AVG(IIF(p.hour_ending >7 and p.hour_ending < 24, volume, NULL)) as volume_peak,
       AVG(IIF(p.hour_ending <=7 or p.hour_ending = 24, volume, NULL)) as volume_off_peak
FROM csv_data p
WHERE p.Date >= DateAdd('m', -2, Date())
GROUP BY Month(p.date)

-- month                peak            off_peak
--     1    62.5979990645683    23.0762401295465
--    11    37.7750002748325    17.0755937444385
--    12    141.063693957234    29.9862605960166

Can SQLAlchemy be used to manipulate Pandas dataframes in Python?

Answers (2)

Related Questions