Reputation: 667
Currently have a dataset structured the following way:
id_number start_date end_date data1 data2 data3 ...
Basically, I have a whole bunch of id's with a certain date range and then multiple columns of summary data. My problem is that I need yearly totals of the summary data. This means I need to get to a place where I can groupby year on a single occurrence of each document. However, it is not guaranteed that a document exists for a given year, and the date ranges can span multiple years. Any help would be greatly appreciated, I am quite stuck.
Sample dataframe:
df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")
Upvotes: 0
Views: 266
Reputation: 9941
Assuming we have a DataFrame df
:
id_num start end value
0 1 2002-03-10 2005-04-12 1
1 1 2005-04-13 2005-05-20 2
2 1 2007-05-21 2009-08-10 3
3 2 2012-02-20 2015-02-20 4
4 3 2003-10-19 2012-12-12 5
we can create a row for each year for our start
to end
ranges with:
ys = [np.arange(x[0], x[1]+1) for x in zip(df['start'].dt.year, df['end'].dt.year)]
df = (pd.DataFrame(ys, df.index)
.stack()
.astype(int)
.reset_index(1, True)
.to_frame('year')
.join(df, how='left')
.reset_index())
print(df)
Here we're first creating the ys
variable with the list of years for each start
-end
range from our DataFrame, and the df = ...
is splitting these year lists into separate rows and joining back to the original DataFrame (very similar to what's done in this post: How to convert column with list of values into rows in Pandas DataFrame).
Output:
index year id_num start end value
0 0 2002 1 2002-03-10 2005-04-12 1
1 0 2003 1 2002-03-10 2005-04-12 1
2 0 2004 1 2002-03-10 2005-04-12 1
3 0 2005 1 2002-03-10 2005-04-12 1
4 1 2005 1 2005-04-13 2005-05-20 2
5 2 2007 1 2007-05-21 2009-08-10 3
6 2 2008 1 2007-05-21 2009-08-10 3
7 2 2009 1 2007-05-21 2009-08-10 3
8 3 2012 2 2012-02-20 2015-02-20 4
9 3 2013 2 2012-02-20 2015-02-20 4
10 3 2014 2 2012-02-20 2015-02-20 4
11 3 2015 2 2012-02-20 2015-02-20 4
12 4 2003 3 2003-10-19 2012-12-12 5
13 4 2004 3 2003-10-19 2012-12-12 5
14 4 2005 3 2003-10-19 2012-12-12 5
15 4 2006 3 2003-10-19 2012-12-12 5
16 4 2007 3 2003-10-19 2012-12-12 5
17 4 2008 3 2003-10-19 2012-12-12 5
18 4 2009 3 2003-10-19 2012-12-12 5
19 4 2010 3 2003-10-19 2012-12-12 5
20 4 2011 3 2003-10-19 2012-12-12 5
21 4 2012 3 2003-10-19 2012-12-12 5
Note:
I changed the original ranges to test cases where there are some years missing for some id_num
, e.g. for id_num=1
we have years 2002-2005
, 2005-2005
and 2007-2009
, so we should not get 2006
for id_num=1
in the output (and we don't, so it passes the test)
Upvotes: 2
Reputation: 11657
I've taken your example and added some random values so we have something to work with:
df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")
np.random.seed(0) # seeding the random values for reproducibility
df['value'] = np.random.random(len(df))
So far we have:
id_num start end value
0 1 2002-03-10 2005-04-12 0.548814
1 1 2005-04-13 2005-05-20 0.715189
2 1 2005-05-21 2009-08-10 0.602763
3 2 2012-02-20 2015-02-20 0.544883
4 3 2003-10-19 2012-12-12 0.423655
We want values at the end of the year for each given date, whether it is beginning or end. So we will treat all dates the same. We just want date + user + value:
tmp = df[['end', 'value']].copy()
tmp = tmp.rename(columns={'end':'start'})
new = pd.concat([df[['start', 'value']], tmp], sort=True)
new['id_num'] = df.id_num.append(df.id_num) # doubling the id numbers
Giving us:
start value id_num
0 2002-03-10 0.548814 1
1 2005-04-13 0.715189 1
2 2005-05-21 0.602763 1
3 2012-02-20 0.544883 2
4 2003-10-19 0.423655 3
0 2005-04-12 0.548814 1
1 2005-05-20 0.715189 1
2 2009-08-10 0.602763 1
3 2015-02-20 0.544883 2
4 2012-12-12 0.423655 3
Now we can group by ID number and year:
new = new.groupby(['id_num', new.start.dt.year]).sum().reset_index(0).sort_index()
id_num value
start
2002 1 0.548814
2003 3 0.423655
2005 1 2.581956
2009 1 0.602763
2012 2 0.544883
2012 3 0.423655
2015 2 0.544883
And finally, for each user we expand the range to have every year in between, filling forward missing data:
new = new.groupby('id_num').apply(lambda x: x.reindex(pd.RangeIndex(x.index.min(), x.index.max() + 1)).fillna(method='ffill')).drop(columns='id_num')
value
id_num
1 2002 0.548814
2003 0.548814
2004 0.548814
2005 2.581956
2006 2.581956
2007 2.581956
2008 2.581956
2009 0.602763
2 2012 0.544883
2013 0.544883
2014 0.544883
2015 0.544883
3 2003 0.423655
2004 0.423655
2005 0.423655
2006 0.423655
2007 0.423655
2008 0.423655
2009 0.423655
2010 0.423655
2011 0.423655
2012 0.423655
Upvotes: 0