Python_Learner
Python_Learner

Reputation: 1637

Compare Same Days of Different Years in Pandas

I'm a newbie still trying to figure out Pandas - split/apply/combine is just starting to make sense but I'm not there yet.

I'm trying to get the historical high and low sales call numbers from the same day in previous years (2013-2015), and then have another column that tells me what days the 2016 calls were higher than previous highs or lower than previous lows and what the values were.

Here's my attempt so far:

df = pd.read_csv('filename.csv')
df['Date']  = pd.to_datetime(df['Date'])
df = df[(df['Date']>= '01/01/2013') & (df['Date'] <= '12/31/2015')]
df['Month']     = df.Date.dt.month
df['Day']   = df.Date.dt.day

dfMAX = df[df['Element'] == "MAX"]
dfMAX = (dfMAX.groupby([dfMAX.Date.dt.month, dfMAX.Date.dt.day, 'Element'])
            [['Data_Value']]
            .agg(['max']))

This gets me the max values per day and I know I could repeat this for the min values. I'm struggling to know how to put it back together with the multi-index and then how to get the 2016 values that were either higher or lower than the min/max already recorded.

I'm not worried about leap year data, Feb 29th can be dropped/ignored for this.

The output would be something like this:

Day of Year   Min   Max     2016
    1/1       50   1900     
    1/2       23   2100     2102
    1/3       90   1800      85
    1/4       89   1750
    1/5       50   2309      45
    1/6       44   5600     5649

I've posted a csv file here: https://drive.google.com/open?id=0B4xdnV0LFZI1dUE3ZFBxdWFQOGc

Thank you for your help, Me

Upvotes: 2

Views: 5235

Answers (1)

Scott Boston
Scott Boston

Reputation: 153460

I would tackle this using the follow steps:

  1. Make sure date is a datetime column dtype. Then, create a new column in your dataframe called Day_Of_Year using .dt.dayofyear.
  2. Split your dataframe into two dataframes, one 2013 thru 2015 and 2016.
  3. Take your 2013 thru 2015 dataframe and use groupby with .agg of min and max to get your min and max values for 2013 thru 2015 to create a dataframe daily records.
  4. Next merge this new daily records dataframe with 2016 dataframe using pd.merge, you can set the index on 2016 to date and the merge on indexes using left_index=True and right_index=True.
  5. Lastly, I would then using boolean indexing to select only those records where the 2016 value is outside of the min or the max columns in your merged dataframe.

You should get something like this with your data:

           min   max       Date Element  Value
DayofYear                                     
1          545  1812 2016-01-01     MAX   1887
3          108  1815 2016-01-03     MAX   1906
4          496  1618 2016-01-04     MAX   1701
6          455  1864 2016-01-06     MIN    169
8          511  1771 2016-01-08     MIN    232

See my spoiler code hover below.

df = pd.read_csv('Downloads/Day_Over_Day.csv', parse_dates=['Date'])
1. df['DayofYear']=df['Date'].dt.dayofyear
2a. df_13_15 = df.query('Date < "2016-01-01"')
2b. df_2016 = df.query('Date >= "2016-01-01"')
3. & 4. df_Overlay = pd.merge(df_13_15.groupby('DayofYear') ['Value'].agg(['min','max']),df_2016.set_index('DayofYear'), left_index=True, right_index=True)
5. print(df_Overlay[(df_Overlay['Value'] < df_Overlay['min']) | (df_Overlay['Value'] > df_Overlay['max'])].head())

Upvotes: 3

Related Questions