Reputation: 25
I have a time series of rainfall from 2011-2013 where rainfall data in 1 (no rain) and 0 (rain) format. The original data interval is 1 hour and from daily at 10 am-3 pm. I don't want to predict the rainfall for 2014 but I want to predict the chance of rain for the whole year of the same time interval based on the occurrence of 1 or 0 in the rainfall column. Currently, I use the following code to predict the chance of rain by counting 1 or 0 appearances:
import pandas as pd
b = {'year':[2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,
2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,2012,
2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013,2013],
'month': [1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,11,12],
'rain':[1,0,0,0,1,1,0,1,1,0,0,1,0,0,1,0,0,0,1,1,1,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0]}
b = pd.DataFrame(b,columns = ['year','month','rain'])
def X(b):
if (b['month'] == 1):
return 'Jan'
elif (b['month']==2):
return 'Feb'
elif (b['month']==3):
return 'Mar'
elif (b['month']==4):
return 'Apr'
elif (b['month']==5):
return 'May'
elif (b['month']==6):
return 'Jun'
elif (b['month']==7):
return 'Jul'
elif (b['month']==8):
return 'Aug'
elif (b['month']==9):
return 'Sep'
elif (b['month']==10):
return 'Oct'
elif (b['month']==11):
return 'Nov'
elif (b['month']==12):
return 'Dec'
b['X'] = b.apply(X,axis =1)
mask_x = (b['X']=='Jul')
mask_y = b['rain'].loc[mask_x]
mask_y.value_counts()
I think this method would not work for large datasets, can someone suggest me an efficient and robust way to predict rainfall from such kind of dataset.
Upvotes: 2
Views: 288
Reputation: 35155
The data was created by randomly selecting [0,1]
every hour. We calculated the total and the number of cases by grouping them by time in the date column. Now you can calculate the rainfall rate by total/number of events. I'm following your code to create year, month and month shortened names, but it's not really necessary.
import pandas as pd
import numpy as np
import random
random.seed(20200817)
date_rng = pd.date_range('2013-01-01', '2016-01-01', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
hour_rain = b.groupby([b.date.dt.month, b.date.dt.day, b.date.dt.hour])['rain'].agg([sum,np.size])
hour_rain.index.names = ['month','day','hour']
hour_rain.reset_index()
month day hour sum size
0 1 1 0 0 4
1 1 1 1 2 3
2 1 1 2 3 3
3 1 1 3 1 3
4 1 1 4 1 3
... ... ... ... ... ...
8755 12 31 19 2 3
8756 12 31 20 2 3
8757 12 31 21 2 3
8758 12 31 22 0 3
8759 12 31 23 0 3
Upvotes: 1
Reputation: 25
What I am trying to do it looks like something below:
import pandas as pd
import numpy as np
import random
random.seed(20200817)
date_rng = pd.date_range('2013-01-01', '2015-12-31', freq='1H')
rain = random.choices([0,1], k=len(date_rng))
b = pd.DataFrame({'date':pd.to_datetime(date_rng), 'rain':rain})
b['year'] = b['date'].dt.year
b['month'] = b['date'].dt.month
b['day'] = b['date'].dt.day
b['hour'] = b['date'].dt.hour
b['X'] = b['date'].dt.strftime('%b')
b['hour']= b['hour'].astype(str).str.zfill(2)
b['day']= b['day'].astype(str).str.zfill(2)
# Joint the Month, Date, Hour and Minute together
b['var'] = b['X']+b['day'].astype(str)+b['hour'].astype(str)
cols = b.columns.tolist()
cols = cols[-1:] + cols[:-1]
b = b[cols]
# drop the unwanted columns
b = b.drop(["date","month","X","hour","day","year"], axis=1)
# now lets say I wanna predict 20 January 15.00 chance of rain
mask_x = (b['var']=='Jan2015')
mask_y = b['rain'].loc[mask_x]
mask_y.value_counts()
output:
0 2
1 1
# means the chance of rain is 33.33% and no chance of rain is 66.67%
When I do this with large datasets (more than 20 years) I feel it doesn't work very well.
Upvotes: 0