Plug4
Plug4

Reputation: 3938

Pandas: De-seasonalizing time-series data

I have the following dataframe df:

[Out]:

                     VOL
2011-04-01 09:30:00  11297
2011-04-01 09:30:10  6526
2011-04-01 09:30:20  14021
2011-04-01 09:30:30  19472
2011-04-01 09:30:40  7602
...
2011-04-29 15:59:30  79855
2011-04-29 15:59:40  83050
2011-04-29 15:59:50  602014

This df consist of volume observations at every 10 second for 22 non-consecutive days. I want to DE-seasonalized my time-series by dividing each observations by the average volume of their respective 5 minute time interval. To do so, I need to take the time-series average of volume at every 5 minutes across the 22 days. So I would end up with a time-series of averages at every 5 minutes 9:30:00 - 9:35:00; 9:35:00 - 9:40:00; 9:40:00 - 9:45:00 ... until 16:00:00. The average for the interval 9:30:00 - 9:35:00 is the average of volume for this time interval across all 22 days (i.e. So the average between 9:30:00 to 9:35:00 is the total volume between 9:30:00 to 9:35:00 on (day 1 + day 2 + day 3 ... day 22) / 22 . Does it makes sense?). I would then divide each observations in df that are between 9:30:00 - 9:35:00 by the average of this time interval.

Is there a package in Python / Pandas that can do this?

Upvotes: 1

Views: 1204

Answers (1)

Zero
Zero

Reputation: 76947

Edited answer:

date_times = pd.date_range(datetime.datetime(2011, 4, 1, 9, 30),
                           datetime.datetime(2011, 4, 16, 0, 0),
                           freq='10s')
VOL = np.random.sample(date_times.size) * 10000.0

df = pd.DataFrame(data={'VOL': VOL,'time':date_times}, index=date_times)
df['h'] = df.index.hour
df['m'] = df.index.minute
df1 = df.resample('5Min', how={'VOL': np.mean})
times = pd.to_datetime(df1.index)
df2 = df1.groupby([times.hour,times.minute]).VOL.mean().reset_index()
df2.columns = ['h','m','VOL']
df.merge(df2,on=['h','m'])
df_norm = df.merge(df2,on=['h','m'])
df_norm['norm'] = df_norm['VOL_x']/df_norm['VOL_y']

** Older answer (keeping it temporarily)

Use resample function

df.resample('5Min', how={'VOL': np.mean})

eg:

date_times = pd.date_range(datetime.datetime(2011, 4, 1, 9, 30),
                           datetime.datetime(2011, 4, 16, 0, 0),
                           freq='10s')
VOL = np.random.sample(date_times.size) * 10000.0

df = pd.DataFrame(data={'VOL': VOL}, index=date_times)
df.resample('5Min', how={'VOL': np.mean})

Upvotes: 4

Related Questions