10mjg
10mjg

Reputation: 579

Input a mean and standard deviation to apply probability distribution across DataFrame Pandas Python

I have a series of dates in the future. I would like to use an assumption about standard deviation and mean of a yet-to-be scheduled event to 'forecast' the future probability of that event falling on any given day. Say I have a Pandas DF with a min/max date of 1/8/16 and 2/3/16. I currently just running an equal probability over all days (.037 ...).

I have it in a dataframe that looks like this (I have filled in the desired Standard_dev_assisted_probability manually):

Poss_Date   Equal_probability  Standard_dev_assisted_probability

1/8/2016    0.037037            min date in poss date range
1/9/2016    0.037037
1/10/2016   0.037037
1/11/2016   0.037037            -1st dv / two thirds border
1/12/2016   0.037037
1/13/2016   0.037037
1/14/2016   0.037037
1/15/2016   0.037037
1/16/2016   0.037037
1/17/2016   0.037037
...         ...
1/22/2016   0.037037            mean / peak of distribution
...         ...
2/1/2016    .03707              +~1std dev
2/3/2016    0.037037            max date in poss range

If we assume that the 'mean' of the future distribution is 1/22/16, And the standard dev is 11 days...

Is there a way to plug those in to the Pandas DF and have it spit back out a column with probability? Obviously, 66% of the probability should then be allocated +/- 11 days around 1/22, with normal distribution/etc.

I'm imaginging in pseudocode it will be something like:

df['Probability']=df.applystandarddev(column=dates,mean=1/22,stddv=11)

If we don't need to 'account' for the abbreviated period of time past the mean, great. Obviously there is more time before the mean than after, but I figure that's part of the statistics game that the libraries handle, etc.

Upvotes: 4

Views: 1118

Answers (1)

David Maust
David Maust

Reputation: 8270

By taking the CDF of a given probability at the end of the day and at the beginning of the day, we are able to find the probability that an event will occur during that day.

Here is an example with a normal distribution.

from scipy.stats.distributions import norm

def prob_distribution(day, mean_day, std):
    start_z = float((day - mean_day).days) / std
    end_z = float((day - mean_day).days + 1) / std
    return norm.cdf(end_z) - norm.cdf(start_z)

df['Prob'] = df['Poss_Date'].apply(lambda day: prob_distribution(day, datetime(2016,2,1), 10))

Upvotes: 2

Related Questions