Reputation: 49
I have a data frame of dates in pandas and I want to filter it such that 'date_id' is between 'start_date' and 'end_date'
date_id start_date end_date
0 2010-06-04 2008-08-01 2008-09-26
1 2010-06-04 2008-08-01 2008-09-26
2 2010-06-04 2008-08-01 2008-09-26
3 2010-06-04 2008-08-26 2008-10-26
4 2010-06-04 2010-05-01 2010-09-26
5 2010-06-04 2008-08-01 2008-09-26
6 2010-06-04 2008-08-01 2008-09-26
7 2010-09-04 2010-08-01 2010-09-26
I've tried using the code below:
df[(df['date_id'] >= df['start_date'] & df['date_id']<= df['end_date')]
The code above results in a key error. I am a new pandas user so any assistance/documentation would be incredibly helpful.
Upvotes: 2
Views: 82
Reputation: 862396
I think need change column name to end_date_y
and add ()
because operator precedence:
df1 = df[(df['date_id'] >= df['start_date']) & (df['date_id']<= df['end_date_y'])]
Or use between
:
df1 = df[df['date_id'].between(df['start_date'], df['end_date_y'])]
print (df1)
date_id start_date end_date_y
4 2010-06-04 2010-05-01 2010-09-26
7 2010-09-04 2010-08-01 2010-09-26
Performance:
Depends of number of rows and number of matched rows, so the best test in real data.
#[80000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#print (df)
In [236]: %timeit df[df['date_id'].between(df['start_date'], df['end_date_y'])]
2.44 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [237]: %timeit df[(df['date_id'] >= df['start_date']) & (df['date_id']<= df['end_date_y'])]
2.42 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [238]: %timeit df.query("start_date <= date_id <= end_date_y")
4.45 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1
Reputation: 5126
You can use between!
df['date_id'].between(df['start_date'],df['end_date_y'])
and to filter, just use .loc
df.loc[df['date_id'].between(df['start_date'],df['end_date_y'])]
date_id start_date end_date_y
4 2010-06-04 2010-05-01 2010-09-26
7 2010-09-04 2010-08-01 2010-09-26
Upvotes: 2
Reputation: 13255
You can also use query
as:
df.query("start_date <= date_id <= end_date_y")
date_id start_date end_date_y
4 2010-06-04 2010-05-01 2010-09-26
7 2010-09-04 2010-08-01 2010-09-26
Upvotes: 1