Md Mahmudul Hasan
Md Mahmudul Hasan

Reputation: 149

How to select rows within two dates in a pandas dataframe?

I have a pandas dataframe such as:

ids    year    month    visit_counts
112    2012      1           5
112    2012      5           25
414    2012      2           10
112    2013      5           10
414    2012      4           35
112    2013      8           16
112    2013     12           23
112    2014      2           20
112    2014      4           28
414    2013      6           15
414    2013     11           30
414    2014      2           24
414    2014      7           17

Each row represent a record with a date of occurrence. lets say, the first date of occurrence associated with id 112 is (year 2012, and month january (1)), and this first date of occurrence for different ids are different. I need to create new dataframe where for every id I want to have all the records within two years starting since the first date of occurrence.

So, the output dataframe should look like this:

ids    year    month    visit_counts
112    2012      1           5
112    2012      5           25
414    2012      2           10
112    2013      5           10
414    2012      4           35
112    2013      8           16
112    2013     12           23
414    2013      6           15
414    2013     11           30
414    2014      2           24

It would be helpful, if anyone can provide the codes. Thank you!

Upvotes: 2

Views: 255

Answers (3)

BENY
BENY

Reputation: 323226

Here is one way

s=df.eval('year*12+month')
df[s<s.groupby(df['ids']).transform('min')+24]
    ids  year  month  visit_counts
0   112  2012      1             5
1   112  2012      5            25
2   414  2012      2            10
3   112  2013      5            10
4   414  2012      4            35
5   112  2013      8            16
6   112  2013     12            23
9   414  2013      6            15
10  414  2013     11            30

Upvotes: 4

pythonic833
pythonic833

Reputation: 3224

First we create a helper column time_in_months to be able to compare the dates. Then use .groupby and .transform to find the dates which are at maximum 24 months apart from the minimum date and filter for these.

df['time_in_months'] = df['year'] * 12 + df['month']
df[df.groupby('ids').time_in_months.transform(lambda x: x -x.min() <= 24  )].drop('time_in_months', 1)

result

        ids     year    month visit_counts
0       112     2012    1     5
1       112     2012    5     25
2       414     2012    2     10
3       112     2013    5     10
4       414     2012    4     35
5       112     2013    8     16
6       112     2013    12    23
9       414     2013    6     15
10      414     2013    11    30
11      414     2014    2     24

Upvotes: 5

ALollz
ALollz

Reputation: 59519

Work with datetime64. We create a date, use groupby + transform to obtain the first date for each ID, then we can keep only rows within 2 years with a simple Boolean mask.

I've left the extra columns in for illustration, but drop them if you'd like.

import pandas as pd

df['date'] = pd.to_datetime(df.assign(day=1)[['year', 'month', 'day']])
df['first_d'] = df.groupby('ids').date.transform('min')
m = df.date <= df.first_d + pd.offsets.DateOffset(years=2)

df.loc[m]

    ids  year  month  visit_counts       date    first_d
0   112  2012      1             5 2012-01-01 2012-01-01
1   112  2012      5            25 2012-05-01 2012-01-01
2   414  2012      2            10 2012-02-01 2012-02-01
3   112  2013      5            10 2013-05-01 2012-01-01
4   414  2012      4            35 2012-04-01 2012-02-01
5   112  2013      8            16 2013-08-01 2012-01-01
6   112  2013     12            23 2013-12-01 2012-01-01
9   414  2013      6            15 2013-06-01 2012-02-01
10  414  2013     11            30 2013-11-01 2012-02-01
11  414  2014      2            24 2014-02-01 2012-02-01

Upvotes: 4

Related Questions