Reputation: 579
I have a series of dates in the future. I would like to use an assumption about standard deviation and mean of a yet-to-be scheduled event to 'forecast' the future probability of that event falling on any given day. Say I have a Pandas DF with a min/max
date of 1/8/16
and 2/3/16
. I currently just running an equal probability over all days (.037 ...
).
I have it in a dataframe that looks like this (I have filled in the desired Standard_dev_assisted_probability
manually):
Poss_Date Equal_probability Standard_dev_assisted_probability
1/8/2016 0.037037 min date in poss date range
1/9/2016 0.037037
1/10/2016 0.037037
1/11/2016 0.037037 -1st dv / two thirds border
1/12/2016 0.037037
1/13/2016 0.037037
1/14/2016 0.037037
1/15/2016 0.037037
1/16/2016 0.037037
1/17/2016 0.037037
... ...
1/22/2016 0.037037 mean / peak of distribution
... ...
2/1/2016 .03707 +~1std dev
2/3/2016 0.037037 max date in poss range
If we assume that the 'mean' of the future distribution is 1/22/16
, And the standard dev is 11 days...
Is there a way to plug those in to the Pandas DF and have it spit back out a column with probability? Obviously, 66% of the probability should then be allocated +/- 11
days around 1/22
, with normal distribution/etc.
I'm imaginging in pseudocode it will be something like:
df['Probability']=df.applystandarddev(column=dates,mean=1/22,stddv=11)
If we don't need to 'account' for the abbreviated period of time past the mean, great. Obviously there is more time before the mean than after, but I figure that's part of the statistics game that the libraries handle, etc.
Upvotes: 4
Views: 1118
Reputation: 8270
By taking the CDF of a given probability at the end of the day and at the beginning of the day, we are able to find the probability that an event will occur during that day.
Here is an example with a normal distribution.
from scipy.stats.distributions import norm
def prob_distribution(day, mean_day, std):
start_z = float((day - mean_day).days) / std
end_z = float((day - mean_day).days + 1) / std
return norm.cdf(end_z) - norm.cdf(start_z)
df['Prob'] = df['Poss_Date'].apply(lambda day: prob_distribution(day, datetime(2016,2,1), 10))
Upvotes: 2