Reputation: 258
Trying to get an understanding how to "standardize" time series data.
Below is example df:
string = np.random.choice(np.array([x for x in 'abcde']*100))
visits = np.random.poisson(1000, size=500)
date = pd.Series(pd.date_range("2020", freq="D", periods=500))
df = pd.DataFrame({'string': string, 'visits': visits, 'date': date})
I would like to be able to add a column for any time range selected. Two examples would be:
df[Avg_visits_month] = sum of visits in each month/ total US business days in that month
df[Avg_visits_quarter]=sum of visits in each quarter/total US business days in that quarter
Attempts have been made using resample (cut off days in first month), groupby(had trouble getting count of bus_days), transform(could get count of num visits per time period, but not the bus day count), numpy.busday_count. I just couldn't get there.
Upvotes: 0
Views: 50
Reputation: 23
More features will get you through this.
df['day_of_week'] = df.date.weekday() # will return 1 - 7
This makes
df['avg_visits_month'] = len(df) / len(df[(df.day_of_week > 1) & (df.day_of_week < 6)])
Note: I'm assuming each record is a visit in this instance which I think is implied by your initial structure, but unsure.
As a semantic point, I do not think I would call this 'standardization' as much as feature development. Perhaps this is a over-focus on normalization concepts on my part, though.
There are a couple of other ways to do this, as you have described, but this is one way.
Upvotes: 0
Reputation: 153510
IIUC, let's use pd.Period
and pd.bdate_range
:
df['avg_visits_month'] = df['visits'] / df['date'].dt.to_period('M').apply(lambda x: len(pd.bdate_range(x.start_time, x.end_time)))
df['avg_visits_quarter'] = df['visits'] / df['date'].dt.to_period('Q').apply(lambda x: len(pd.bdate_range(x.start_time, x.end_time)))
print(df)
Output:
string visits date avg_visits_month avg_visits_quarter
0 b 997 2020-01-01 43.347826 15.338462
1 b 967 2020-01-02 42.043478 14.876923
2 b 982 2020-01-03 42.695652 15.107692
3 b 978 2020-01-04 42.521739 15.046154
4 b 986 2020-01-05 42.869565 15.169231
.. ... ... ... ... ...
495 b 1018 2021-05-10 48.476190 15.661538
496 b 970 2021-05-11 46.190476 14.923077
497 b 1042 2021-05-12 49.619048 16.030769
498 b 978 2021-05-13 46.571429 15.046154
499 b 1058 2021-05-14 50.380952 16.276923
Upvotes: 2