Stephen Strosko
Stephen Strosko

Reputation: 667

Filter Dates in Pandas

Currently have a dataset structured the following way:

id_number    start_date    end_date   data1    data2    data3   ...

enter image description here

Basically, I have a whole bunch of id's with a certain date range and then multiple columns of summary data. My problem is that I need yearly totals of the summary data. This means I need to get to a place where I can groupby year on a single occurrence of each document. However, it is not guaranteed that a document exists for a given year, and the date ranges can span multiple years. Any help would be greatly appreciated, I am quite stuck.

Sample dataframe:

df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")

Upvotes: 0

Views: 266

Answers (2)

perl
perl

Reputation: 9941

Assuming we have a DataFrame df:

   id_num      start        end  value
0       1 2002-03-10 2005-04-12      1
1       1 2005-04-13 2005-05-20      2
2       1 2007-05-21 2009-08-10      3
3       2 2012-02-20 2015-02-20      4
4       3 2003-10-19 2012-12-12      5

we can create a row for each year for our start to end ranges with:

ys = [np.arange(x[0], x[1]+1) for x in zip(df['start'].dt.year, df['end'].dt.year)]

df = (pd.DataFrame(ys, df.index)
     .stack()
     .astype(int)
     .reset_index(1, True)
     .to_frame('year')
     .join(df, how='left')
     .reset_index())

print(df)

Here we're first creating the ys variable with the list of years for each start-end range from our DataFrame, and the df = ... is splitting these year lists into separate rows and joining back to the original DataFrame (very similar to what's done in this post: How to convert column with list of values into rows in Pandas DataFrame).

Output:

    index  year  id_num      start        end  value
0       0  2002       1 2002-03-10 2005-04-12      1
1       0  2003       1 2002-03-10 2005-04-12      1
2       0  2004       1 2002-03-10 2005-04-12      1
3       0  2005       1 2002-03-10 2005-04-12      1
4       1  2005       1 2005-04-13 2005-05-20      2
5       2  2007       1 2007-05-21 2009-08-10      3
6       2  2008       1 2007-05-21 2009-08-10      3
7       2  2009       1 2007-05-21 2009-08-10      3
8       3  2012       2 2012-02-20 2015-02-20      4
9       3  2013       2 2012-02-20 2015-02-20      4
10      3  2014       2 2012-02-20 2015-02-20      4
11      3  2015       2 2012-02-20 2015-02-20      4
12      4  2003       3 2003-10-19 2012-12-12      5
13      4  2004       3 2003-10-19 2012-12-12      5
14      4  2005       3 2003-10-19 2012-12-12      5
15      4  2006       3 2003-10-19 2012-12-12      5
16      4  2007       3 2003-10-19 2012-12-12      5
17      4  2008       3 2003-10-19 2012-12-12      5
18      4  2009       3 2003-10-19 2012-12-12      5
19      4  2010       3 2003-10-19 2012-12-12      5
20      4  2011       3 2003-10-19 2012-12-12      5
21      4  2012       3 2003-10-19 2012-12-12      5

Note: I changed the original ranges to test cases where there are some years missing for some id_num, e.g. for id_num=1 we have years 2002-2005, 2005-2005 and 2007-2009, so we should not get 2006 for id_num=1 in the output (and we don't, so it passes the test)

Upvotes: 2

Josh Friedlander
Josh Friedlander

Reputation: 11657

I've taken your example and added some random values so we have something to work with:

df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")

np.random.seed(0)  # seeding the random values for reproducibility
df['value'] = np.random.random(len(df))

So far we have:

    id_num  start   end     value
0   1   2002-03-10  2005-04-12  0.548814
1   1   2005-04-13  2005-05-20  0.715189
2   1   2005-05-21  2009-08-10  0.602763
3   2   2012-02-20  2015-02-20  0.544883
4   3   2003-10-19  2012-12-12  0.423655

We want values at the end of the year for each given date, whether it is beginning or end. So we will treat all dates the same. We just want date + user + value:

tmp = df[['end', 'value']].copy()
tmp = tmp.rename(columns={'end':'start'})
new = pd.concat([df[['start', 'value']], tmp], sort=True)
new['id_num'] = df.id_num.append(df.id_num)  # doubling the id numbers

Giving us:

    start      value    id_num
0   2002-03-10  0.548814    1
1   2005-04-13  0.715189    1
2   2005-05-21  0.602763    1
3   2012-02-20  0.544883    2
4   2003-10-19  0.423655    3
0   2005-04-12  0.548814    1
1   2005-05-20  0.715189    1
2   2009-08-10  0.602763    1
3   2015-02-20  0.544883    2
4   2012-12-12  0.423655    3

Now we can group by ID number and year:

new = new.groupby(['id_num', new.start.dt.year]).sum().reset_index(0).sort_index()

    id_num  value
start       
2002    1   0.548814
2003    3   0.423655
2005    1   2.581956
2009    1   0.602763
2012    2   0.544883
2012    3   0.423655
2015    2   0.544883

And finally, for each user we expand the range to have every year in between, filling forward missing data:

new = new.groupby('id_num').apply(lambda x: x.reindex(pd.RangeIndex(x.index.min(), x.index.max() + 1)).fillna(method='ffill')).drop(columns='id_num')

             value
id_num      
1   2002    0.548814
    2003    0.548814
    2004    0.548814
    2005    2.581956
    2006    2.581956
    2007    2.581956
    2008    2.581956
    2009    0.602763
2   2012    0.544883
    2013    0.544883
    2014    0.544883
    2015    0.544883
3   2003    0.423655
    2004    0.423655
    2005    0.423655
    2006    0.423655
    2007    0.423655
    2008    0.423655
    2009    0.423655
    2010    0.423655
    2011    0.423655
    2012    0.423655

Upvotes: 0

Related Questions