Nahid
Nahid

Reputation: 25

Probabilistic prediction based on occurrence frequency

I have a time series of rainfall from 2011-2013 where rainfall data in 1 (no rain) and 0 (rain) format. The original data interval is 1 hour and from daily at 10 am-3 pm. I don't want to predict the rainfall for 2014 but I want to predict the chance of rain for the whole year of the same time interval based on the occurrence of 1 or 0 in the rainfall column. Currently, I use the following code to predict the chance of rain by counting 1 or 0 appearances:

import pandas as pd
 
b = {'year':[2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,
             2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,
             2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013],
     'month': [1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12],
     'rain':[1,0,0,0,1,1,0,1,1,0,0,1,0,0,1,0,0,0,1,1,1,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0]}

b = pd.DataFrame(b,columns = ['year','month','rain'])

def X(b):
    if (b['month'] == 1):
        return 'Jan'
    elif (b['month']==2):
        return 'Feb'
    elif (b['month']==3):
        return 'Mar'
    elif (b['month']==4):
        return 'Apr'
    elif (b['month']==5):
        return 'May'
    elif (b['month']==6):
        return 'Jun'
    elif (b['month']==7):
        return 'Jul'
    elif (b['month']==8):
        return 'Aug'
    elif (b['month']==9):
        return 'Sep'
    elif (b['month']==10):
        return 'Oct'
    elif (b['month']==11):
        return 'Nov'
    elif (b['month']==12):
        return 'Dec' 

b['X'] = b.apply(X,axis =1)

mask_x = (b['X']=='Jul')

mask_y = b['rain'].loc[mask_x]

mask_y.value_counts()

I think this method would not work for large datasets, can someone suggest me an efficient and robust way to predict rainfall from such kind of dataset.

Upvotes: 2

Views: 288

Answers (2)

r-beginners
r-beginners

Reputation: 35155

The data was created by randomly selecting [0,1] every hour. We calculated the total and the number of cases by grouping them by time in the date column. Now you can calculate the rainfall rate by total/number of events. I'm following your code to create year, month and month shortened names, but it's not really necessary.

import pandas as pd
import numpy as np
import random

random.seed(20200817)

date_rng = pd.date_range('2013-01-01', '2016-01-01', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})

hour_rain = b.groupby([b.date.dt.month, b.date.dt.day, b.date.dt.hour])['rain'].agg([sum,np.size])
hour_rain.index.names = ['month','day','hour']

hour_rain.reset_index()

month   day hour    sum size
0   1   1   0   0   4
1   1   1   1   2   3
2   1   1   2   3   3
3   1   1   3   1   3
4   1   1   4   1   3
... ... ... ... ... ...
8755    12  31  19  2   3
8756    12  31  20  2   3
8757    12  31  21  2   3
8758    12  31  22  0   3
8759    12  31  23  0   3

Upvotes: 1

Nahid
Nahid

Reputation: 25

What I am trying to do it looks like something below:

import pandas as pd
import numpy as np
import random

random.seed(20200817)
date_rng = pd.date_range('2013-01-01', '2015-12-31', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
b['year'] = b['date'].dt.year
b['month'] = b['date'].dt.month
b['day'] = b['date'].dt.day
b['hour'] = b['date'].dt.hour
b['X'] = b['date'].dt.strftime('%b')

b['hour']= b['hour'].astype(str).str.zfill(2)
b['day']= b['day'].astype(str).str.zfill(2)


# Joint the Month, Date, Hour and Minute together
b['var'] = b['X']+b['day'].astype(str)+b['hour'].astype(str)


cols = b.columns.tolist()
cols = cols[-1:] + cols[:-1]
b = b[cols]


# drop the unwanted columns
b = b.drop(["date","month","X","hour","day","year"], axis=1)


# now lets say I wanna predict 20 January 15.00 chance of rain

mask_x = (b['var']=='Jan2015')

mask_y = b['rain'].loc[mask_x]

mask_y.value_counts()

output:
0    2
1    1

# means the chance of rain is 33.33% and no chance of rain is 66.67% 

When I do this with large datasets (more than 20 years) I feel it doesn't work very well.

Upvotes: 0

Related Questions