Sean McCarthy
Sean McCarthy

Reputation: 5568

Can SQLAlchemy be used to manipulate Pandas dataframes in Python?

Rather than learn Pandas syntax, which would take some time, I'd like to use my existing SQL skills to manipulate Pandas dataframes in Python.

My data are already in Pandas dataframes for various reasons.

If someone knows the answer, can you show me a simple example to get me started? Something like the following, I'm guessing:

import sqlalchemy

query = """
    select 
        trunc(date, 'MM') as month
        , avg(case when hour_ending > 7 and hour_ending < 24 then volume else null end) as volume_peak
        , avg(case when hour_ending <= 7 or hour_ending = 24 then volume else null end) as volume_off_peak
    from pandas_df
    where date >= add_months(trunc(sysdate, 'DD'), -2)
    group by trunc(date, 'MM')
    order by trunc(date, 'MM');
"""

new_df = sqlalchemy.run_query(query)

Thanks! Sean

Upvotes: 0

Views: 421

Answers (2)

Parfait
Parfait

Reputation: 107652

Meanwhile, consider learning pandas's conditional logic/filter/aggregation methods which are translatable from your SQL skills:

  1. trunc(..., 'MM') ---> .dt.month
  2. CASE WHEN ---> df.loc or np.where()
  3. WHERE ---> []
  4. GROUP BY ---> .groupby
  5. AVG --> .agg('mean')

Here is an example

Random Data (seeded for reproducibility)

import numpy as np
import pandas as pd
import datetime as dt
import time

epoch_time = int(time.time())

np.random.seed(101)
pandas_df = pd.DataFrame({'date': [dt.datetime.fromtimestamp(np.random.randint(1480000000, epoch_time)) for _ in range(50)],
                          'hour_ending': [np.random.randint(14) for _ in range(50)],
                          'volume': abs(np.random.randn(50)*100)})

# OUTPUT CSV AND IMPORT INTO DATABASE TO TEST RESULT
pandas_df.to_csv('Output.csv', index=False)

print(pandas_df.head(10))

#                  date  hour_ending      volume
# 0 2017-01-01 20:05:19           10   56.660415
# 1 2017-09-02 00:56:27            3   79.060800
# 2 2018-01-04 09:25:05            7   23.076240
# 3 2016-11-27 23:44:55            6  102.801241
# 4 2017-01-29 12:19:55            5   88.824230
# 5 2017-04-15 15:16:09            6  214.168659
# 6 2017-09-07 08:12:45            9   97.607635
# 7 2017-12-31 15:35:36           13  141.467249
# 8 2017-04-21 23:01:44           13  156.246854
# 9 2016-12-22 09:27:49            2   67.646662

Aggregation (calculates columns prior to summarizing)

# CALCULATE GROUP MONTH COLUMN
pandas_df['month'] = pandas_df['date'].dt.month

# CONDITIONAL LOGIC COLUMNS
pandas_df.loc[pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_peak'] = pandas_df['volume'] 
pandas_df.loc[~pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_off_peak'] = pandas_df['volume'] 

print(pandas_df.head(10))
#                  date  hour_ending      volume  month  volume_peak  volume_off_peak
# 0 2017-01-01 20:05:19           10   56.660415      1    56.660415              NaN
# 1 2017-09-02 00:56:27            3   79.060800      9          NaN        79.060800
# 2 2018-01-04 09:25:05            7   23.076240      1          NaN        23.076240
# 3 2016-11-27 23:44:55            6  102.801241     11          NaN       102.801241
# 4 2017-01-29 12:19:55            5   88.824230      1          NaN        88.824230
# 5 2017-04-15 15:16:09            6  214.168659      4          NaN       214.168659
# 6 2017-09-07 08:12:45            9   97.607635      9    97.607635              NaN
# 7 2017-12-31 15:35:36           13  141.467249     12   141.467249              NaN
# 8 2017-04-21 23:01:44           13  156.246854      4   156.246854              NaN
# 9 2016-12-22 09:27:49            2   67.646662     12          NaN        67.646662

# WHERE AND GROUPBY
agg_df = pandas_df[pandas_df['date'] >= (dt.datetime.today() - dt.timedelta(days=60))]\
                   .groupby('month')[['volume_peak', 'volume_off_peak']].agg('mean')

print(agg_df)
#        volume_peak  volume_off_peak
# month                              
# 1        62.597999        23.076240
# 11       37.775000        17.075594
# 12      141.063694        29.986261

SQL (using MS Access that imported csv from above code)

SELECT Month(p.date) As month, 
       AVG(IIF(p.hour_ending >7 and p.hour_ending < 24, volume, NULL)) as volume_peak,
       AVG(IIF(p.hour_ending <=7 or p.hour_ending = 24, volume, NULL)) as volume_off_peak
FROM csv_data p
WHERE p.Date >= DateAdd('m', -2, Date())
GROUP BY Month(p.date)

-- month                peak            off_peak
--     1    62.5979990645683    23.0762401295465
--    11    37.7750002748325    17.0755937444385
--    12    141.063693957234    29.9862605960166

Upvotes: 2

James
James

Reputation: 3709

The short answer is no, there is no direct way that I am aware of to manipulate a pandas DataFrame using SQL Alchemy.

If you really want to use SQL, I think the easiest approach would be to write a couple of helper functions that convert a pandas DataFrame to and from a SQL table using a database driver like SQLite.

However, as a developer well-versed in SQL who recently had to deal with some data in a pandas DataFrame, I can strongly recommend just learning the pandas way of doing things. It's not as declarative as SQL - often what would be a single query in SQL has to be broken up into a few steps in pandas. However I found that working pandas gets easier quickly and the learning process was beneficial and satisfying.

Upvotes: 1

Related Questions