Reputation: 5568
Rather than learn Pandas syntax, which would take some time, I'd like to use my existing SQL skills to manipulate Pandas dataframes in Python.
My data are already in Pandas dataframes for various reasons.
If someone knows the answer, can you show me a simple example to get me started? Something like the following, I'm guessing:
import sqlalchemy
query = """
select
trunc(date, 'MM') as month
, avg(case when hour_ending > 7 and hour_ending < 24 then volume else null end) as volume_peak
, avg(case when hour_ending <= 7 or hour_ending = 24 then volume else null end) as volume_off_peak
from pandas_df
where date >= add_months(trunc(sysdate, 'DD'), -2)
group by trunc(date, 'MM')
order by trunc(date, 'MM');
"""
new_df = sqlalchemy.run_query(query)
Thanks! Sean
Upvotes: 0
Views: 421
Reputation: 107652
Meanwhile, consider learning pandas's conditional logic/filter/aggregation methods which are translatable from your SQL skills:
trunc(..., 'MM') ---> .dt.month
CASE WHEN ---> df.loc or np.where()
WHERE ---> []
GROUP BY ---> .groupby
AVG --> .agg('mean')
Here is an example
Random Data (seeded for reproducibility)
import numpy as np
import pandas as pd
import datetime as dt
import time
epoch_time = int(time.time())
np.random.seed(101)
pandas_df = pd.DataFrame({'date': [dt.datetime.fromtimestamp(np.random.randint(1480000000, epoch_time)) for _ in range(50)],
'hour_ending': [np.random.randint(14) for _ in range(50)],
'volume': abs(np.random.randn(50)*100)})
# OUTPUT CSV AND IMPORT INTO DATABASE TO TEST RESULT
pandas_df.to_csv('Output.csv', index=False)
print(pandas_df.head(10))
# date hour_ending volume
# 0 2017-01-01 20:05:19 10 56.660415
# 1 2017-09-02 00:56:27 3 79.060800
# 2 2018-01-04 09:25:05 7 23.076240
# 3 2016-11-27 23:44:55 6 102.801241
# 4 2017-01-29 12:19:55 5 88.824230
# 5 2017-04-15 15:16:09 6 214.168659
# 6 2017-09-07 08:12:45 9 97.607635
# 7 2017-12-31 15:35:36 13 141.467249
# 8 2017-04-21 23:01:44 13 156.246854
# 9 2016-12-22 09:27:49 2 67.646662
Aggregation (calculates columns prior to summarizing)
# CALCULATE GROUP MONTH COLUMN
pandas_df['month'] = pandas_df['date'].dt.month
# CONDITIONAL LOGIC COLUMNS
pandas_df.loc[pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_peak'] = pandas_df['volume']
pandas_df.loc[~pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_off_peak'] = pandas_df['volume']
print(pandas_df.head(10))
# date hour_ending volume month volume_peak volume_off_peak
# 0 2017-01-01 20:05:19 10 56.660415 1 56.660415 NaN
# 1 2017-09-02 00:56:27 3 79.060800 9 NaN 79.060800
# 2 2018-01-04 09:25:05 7 23.076240 1 NaN 23.076240
# 3 2016-11-27 23:44:55 6 102.801241 11 NaN 102.801241
# 4 2017-01-29 12:19:55 5 88.824230 1 NaN 88.824230
# 5 2017-04-15 15:16:09 6 214.168659 4 NaN 214.168659
# 6 2017-09-07 08:12:45 9 97.607635 9 97.607635 NaN
# 7 2017-12-31 15:35:36 13 141.467249 12 141.467249 NaN
# 8 2017-04-21 23:01:44 13 156.246854 4 156.246854 NaN
# 9 2016-12-22 09:27:49 2 67.646662 12 NaN 67.646662
# WHERE AND GROUPBY
agg_df = pandas_df[pandas_df['date'] >= (dt.datetime.today() - dt.timedelta(days=60))]\
.groupby('month')[['volume_peak', 'volume_off_peak']].agg('mean')
print(agg_df)
# volume_peak volume_off_peak
# month
# 1 62.597999 23.076240
# 11 37.775000 17.075594
# 12 141.063694 29.986261
SQL (using MS Access that imported csv from above code)
SELECT Month(p.date) As month,
AVG(IIF(p.hour_ending >7 and p.hour_ending < 24, volume, NULL)) as volume_peak,
AVG(IIF(p.hour_ending <=7 or p.hour_ending = 24, volume, NULL)) as volume_off_peak
FROM csv_data p
WHERE p.Date >= DateAdd('m', -2, Date())
GROUP BY Month(p.date)
-- month peak off_peak
-- 1 62.5979990645683 23.0762401295465
-- 11 37.7750002748325 17.0755937444385
-- 12 141.063693957234 29.9862605960166
Upvotes: 2
Reputation: 3709
The short answer is no, there is no direct way that I am aware of to manipulate a pandas DataFrame using SQL Alchemy.
If you really want to use SQL, I think the easiest approach would be to write a couple of helper functions that convert a pandas DataFrame to and from a SQL table using a database driver like SQLite.
However, as a developer well-versed in SQL who recently had to deal with some data in a pandas DataFrame, I can strongly recommend just learning the pandas way of doing things. It's not as declarative as SQL - often what would be a single query in SQL has to be broken up into a few steps in pandas. However I found that working pandas gets easier quickly and the learning process was beneficial and satisfying.
Upvotes: 1