Reputation: 43289
I have a pandas data frame as follows:
ID Start End
1 2013-03-03 2013-04-05
2 2013-03-03 2013-04-05
3 2012-03-02 2012-05-05
4 2013-03-01 2013-04-04
5 2013-02-25 2013-06-05
6 2012-02-26 2012-05-05
7 2013-02-25 2013-04-27
8 2012-02-26 2012-03-01
9 2013-03-22 2013-03-25
10 2013-06-27 2013-07-01
Is there a way to aggregate the number of days that occured per month:Year for each interval for each ID.
The output I am trying to get to is the sum of all the days per month:Year that each ID contributes.
So, ID1 will contrinute 29 days to March 2013 and 5 days to April 2013; ID3 will contribute 30days to March 2012, 30 days to April 2012, and 5 days to May 2012. Etc....
Thanks in advance if you can help.
Upvotes: 3
Views: 217
Reputation: 375745
Here's an awful way to do it:
In [11]: df1.apply(lambda x: pd.Series(1, pd.date_range(x.loc['Start'], x.loc['End'] - pd.offsets.Day(1), freq='D')).resample('M', how='sum'), axis=1).fillna(0)
Out[11]:
2012-02-29 2012-03-31 2012-04-30 2012-05-31 2013-02-28 2013-03-31 2013-04-30 2013-05-31 2013-06-30
ID
1 0 0 0 0 0 29 4 0 0
2 0 0 0 0 0 29 4 0 0
3 0 30 30 4 0 0 0 0 0
4 0 0 0 0 0 31 3 0 0
5 0 0 0 0 4 31 30 31 4
6 4 31 30 4 0 0 0 0 0
7 0 0 0 0 4 31 26 0 0
8 4 0 0 0 0 0 0 0 0
9 0 0 0 0 0 3 0 0 0
10 0 0 0 0 0 0 0 0 4
To break it down, for each row we are calculating the days in each month by creating a Series of 1s for every day between the start and end, then summing them up for each month using resample:
In [12]: x = df1.iloc[0]
In [13]: x
Out[13]:
Start 2013-03-03 00:00:00
End 2013-04-05 00:00:00
Name: 1, dtype: datetime64[ns]
In [14]: pd.Series(1, pd.date_range(x['Start'], x['End'] - pd.offsets.Day(1), freq='D')).resample('M', how='sum')
Out[14]:
2013-03-31 29
2013-04-30 4
Freq: M, dtype: int64
Upvotes: 4
Reputation: 226604
The datetime module has the tools you need to do month by month manipulation of dates.
Here's some quick stand-alone code that can easily be adapted to a Pandas dataframe:
import datetime
import collections
import pprint
data = '''\
1 2013-03-03 2013-04-05
2 2013-03-03 2013-04-05
3 2012-03-02 2012-05-05
4 2013-03-01 2013-04-04
5 2013-02-25 2013-06-05
6 2012-02-26 2012-05-05
7 2013-02-25 2013-04-27
8 2012-02-26 2012-03-01
9 2013-03-22 2013-03-25
10 2013-06-27 2013-07-01
'''
for line in data.splitlines():
idnum, startstr, endstr = line.split()
start = datetime.datetime.strptime(startstr, '%Y-%m-%d')
end = datetime.datetime.strptime(endstr, '%Y-%m-%d')
days_in_month = collections.Counter()
current = start
while current <= end:
days_in_month[current.year, current.month] += 1
current += datetime.timedelta(1)
print('Id: %s' % idnum)
pprint.pprint(dict(days_in_month))
print('-' * 10)
Upvotes: 1