Filter Dates in Pandas

Question

Currently have a dataset structured the following way:

id_number    start_date    end_date   data1    data2    data3   ...

Basically, I have a whole bunch of id's with a certain date range and then multiple columns of summary data. My problem is that I need yearly totals of the summary data. This means I need to get to a place where I can groupby year on a single occurrence of each document. However, it is not guaranteed that a document exists for a given year, and the date ranges can span multiple years. Any help would be greatly appreciated, I am quite stuck.

Sample dataframe:

df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")

perl · Accepted Answer

Assuming we have a DataFrame df:

   id_num      start        end  value
0       1 2002-03-10 2005-04-12      1
1       1 2005-04-13 2005-05-20      2
2       1 2007-05-21 2009-08-10      3
3       2 2012-02-20 2015-02-20      4
4       3 2003-10-19 2012-12-12      5

we can create a row for each year for our start to end ranges with:

ys = [np.arange(x[0], x[1]+1) for x in zip(df['start'].dt.year, df['end'].dt.year)]

df = (pd.DataFrame(ys, df.index)
     .stack()
     .astype(int)
     .reset_index(1, True)
     .to_frame('year')
     .join(df, how='left')
     .reset_index())

print(df)

Here we're first creating the ys variable with the list of years for each start-end range from our DataFrame, and the df = ... is splitting these year lists into separate rows and joining back to the original DataFrame (very similar to what's done in this post: How to convert column with list of values into rows in Pandas DataFrame).

Output:

    index  year  id_num      start        end  value
0       0  2002       1 2002-03-10 2005-04-12      1
1       0  2003       1 2002-03-10 2005-04-12      1
2       0  2004       1 2002-03-10 2005-04-12      1
3       0  2005       1 2002-03-10 2005-04-12      1
4       1  2005       1 2005-04-13 2005-05-20      2
5       2  2007       1 2007-05-21 2009-08-10      3
6       2  2008       1 2007-05-21 2009-08-10      3
7       2  2009       1 2007-05-21 2009-08-10      3
8       3  2012       2 2012-02-20 2015-02-20      4
9       3  2013       2 2012-02-20 2015-02-20      4
10      3  2014       2 2012-02-20 2015-02-20      4
11      3  2015       2 2012-02-20 2015-02-20      4
12      4  2003       3 2003-10-19 2012-12-12      5
13      4  2004       3 2003-10-19 2012-12-12      5
14      4  2005       3 2003-10-19 2012-12-12      5
15      4  2006       3 2003-10-19 2012-12-12      5
16      4  2007       3 2003-10-19 2012-12-12      5
17      4  2008       3 2003-10-19 2012-12-12      5
18      4  2009       3 2003-10-19 2012-12-12      5
19      4  2010       3 2003-10-19 2012-12-12      5
20      4  2011       3 2003-10-19 2012-12-12      5
21      4  2012       3 2003-10-19 2012-12-12      5

Note: I changed the original ranges to test cases where there are some years missing for some id_num, e.g. for id_num=1 we have years 2002-2005, 2005-2005 and 2007-2009, so we should not get 2006 for id_num=1 in the output (and we don't, so it passes the test)

Filter Dates in Pandas

Answers (2)

Related Questions