Reputation: 1736
In Python 3.5, Pandas 20, say I have a one year periodic time serie :
import pandas as pd
import numpy as np
start_date = pd.to_datetime("2015-01-01T01:00:00.000Z", infer_datetime_format=True)
end_date = pd.to_datetime("2015-12-31T23:00:00.000Z", infer_datetime_format=True)
index = pd.DatetimeIndex(start=start_date,
freq="60min",
end=end_date)
time = np.array((index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
df = pd.DataFrame(index=index)
df["foo"] = np.sin( 2 * np.pi * time / len(time))
df.plot()
I want to do some periodic extrapolation of the time serie for a new index. I.e with :
new_start_date = pd.to_datetime("2017-01-01T01:00:00.000Z", infer_datetime_format=True)
new_end_date = pd.to_datetime("2019-12-31T23:00:00.000Z", infer_datetime_format=True)
new_index = pd.DatetimeIndex(start=new_start_date,
freq="60min",
end=new_end_date)
I would like to use some kind of extrapolate_periodic
method to get:
# DO NOT RUN
new_df = df.extrapolate_periodic(index=new_index)
# END DO NOT RUN
new_df.plot()
What is the best way do such a thing in pandas?
How can I define a periodicity and get data from a new index easily?
Upvotes: 0
Views: 597
Reputation: 1736
Here is some code I've used to solve my problem. The asumption is that the initial serie corresponds to a period of data.
def extrapolate_periodic(df, new_index):
index = df.index
start_date = np.min(index)
end_date = np.max(index)
period = np.array((end_date - start_date) / np.timedelta64(1, 'h'), dtype=int)
time = np.array((new_index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
new_df = pd.DataFrame(index=new_index)
for col in list(df.columns):
new_df[col] = np.array(df[col].iloc[time % period])
return new_df
Upvotes: 1
Reputation: 4744
I think I have what you are looking for, though it is not a simple pandas method.
Carrying on directly from where you left off,
def extrapolate_periodic(df, new_index):
df_right = df.groupby([df.index.dayofyear, df.index.hour]).mean()
df_left = pd.DataFrame({'new_index': new_index}).set_index('new_index')
df_left = df_left.assign(dayofyear=lambda x: x.index.dayofyear,
hour=lambda x: x.index.hour)
df = (pd.merge(df_left, df_right, left_on=['dayofyear', 'hour'],
right_index=True, suffixes=('', '_y'))
.drop(['dayofyear', 'hour'], axis=1))
return df.sort_index()
new_df = extrapolate_periodic(df, new_index)
# or as a method style
# new_df = df.pipe(extrapolate_periodic, new_index)
new_df.plot()
If you have more that a years worth of data it will take the mean of each duplicated day-hour. Here mean
could be changed for last
if you wanted just the most recent reading.
This will not work if you do not have a full years worth of data but you could fix this by adding in a reindex to complete the year and then using interpolate with a polynomial feature to fill in the missing foo
column.
Upvotes: 1