Reputation: 149
I have a pandas dataframe such as:
ids year month visit_counts
112 2012 1 5
112 2012 5 25
414 2012 2 10
112 2013 5 10
414 2012 4 35
112 2013 8 16
112 2013 12 23
112 2014 2 20
112 2014 4 28
414 2013 6 15
414 2013 11 30
414 2014 2 24
414 2014 7 17
Each row represent a record with a date of occurrence. lets say, the first date of occurrence associated with id 112 is (year 2012, and month january (1)), and this first date of occurrence for different ids are different. I need to create new dataframe where for every id I want to have all the records within two years starting since the first date of occurrence.
So, the output dataframe should look like this:
ids year month visit_counts
112 2012 1 5
112 2012 5 25
414 2012 2 10
112 2013 5 10
414 2012 4 35
112 2013 8 16
112 2013 12 23
414 2013 6 15
414 2013 11 30
414 2014 2 24
It would be helpful, if anyone can provide the codes. Thank you!
Upvotes: 2
Views: 255
Reputation: 323226
Here is one way
s=df.eval('year*12+month')
df[s<s.groupby(df['ids']).transform('min')+24]
ids year month visit_counts
0 112 2012 1 5
1 112 2012 5 25
2 414 2012 2 10
3 112 2013 5 10
4 414 2012 4 35
5 112 2013 8 16
6 112 2013 12 23
9 414 2013 6 15
10 414 2013 11 30
Upvotes: 4
Reputation: 3224
First we create a helper column time_in_months
to be able to compare the dates. Then use .groupby
and .transform
to find the dates which are at maximum 24 months apart from the minimum date and filter for these.
df['time_in_months'] = df['year'] * 12 + df['month']
df[df.groupby('ids').time_in_months.transform(lambda x: x -x.min() <= 24 )].drop('time_in_months', 1)
result
ids year month visit_counts
0 112 2012 1 5
1 112 2012 5 25
2 414 2012 2 10
3 112 2013 5 10
4 414 2012 4 35
5 112 2013 8 16
6 112 2013 12 23
9 414 2013 6 15
10 414 2013 11 30
11 414 2014 2 24
Upvotes: 5
Reputation: 59519
Work with datetime64
. We create a date, use groupby
+ transform
to obtain the first date for each ID, then we can keep only rows within 2 years with a simple Boolean mask.
I've left the extra columns in for illustration, but drop them if you'd like.
import pandas as pd
df['date'] = pd.to_datetime(df.assign(day=1)[['year', 'month', 'day']])
df['first_d'] = df.groupby('ids').date.transform('min')
m = df.date <= df.first_d + pd.offsets.DateOffset(years=2)
df.loc[m]
ids year month visit_counts date first_d
0 112 2012 1 5 2012-01-01 2012-01-01
1 112 2012 5 25 2012-05-01 2012-01-01
2 414 2012 2 10 2012-02-01 2012-02-01
3 112 2013 5 10 2013-05-01 2012-01-01
4 414 2012 4 35 2012-04-01 2012-02-01
5 112 2013 8 16 2013-08-01 2012-01-01
6 112 2013 12 23 2013-12-01 2012-01-01
9 414 2013 6 15 2013-06-01 2012-02-01
10 414 2013 11 30 2013-11-01 2012-02-01
11 414 2014 2 24 2014-02-01 2012-02-01
Upvotes: 4