Reputation: 522
I want to filter my data frame based on time
column using query function and get rows with time
in last N
month(or any time periods).
Basically I need something like:
df = df.query("time.isin(timeperiod('5M'))")
or
df = df.query("now() >= time >= now() - timeperiod('5M')")
I'm trying to create query patterns for my filter functions and It's important to use query() function.
Note that timeperiod()
and now()
functions are not available and I used them for better understanding of the solution I want.I mean please define these functions in your solution.
Sample:
df = pd.DataFrame()
df['time1'] = pd.date_range('1/1/2019', periods=1000, freq='D')
df['time2'] = pd.date_range('1/1/2018', periods=1000, freq='D')
Sample data can have some other columns.
I need to filter data to get records within last N
month for each of given columns using query()
function. For example I want records with time1
in last 2 months and time2
in last 3 months. Like getting records of people who were hired during last 3 month and were let go during last 2 months.
UPDATE 1:
thanks to @jezrael I could get now()
function in query()
like this:
df = df.query("@pd.Timestamp('now') >= time")
of course Timestamp
could be replaced by 'to_datetime` or something else.
But the problem is when I try to get a timestamp for the start of period like this:
df = df.query("time >= (@pd.Timestamp('now') + @pd.to_timedelta('30day'))")
I get:
Cannot convert input ... of type <class 'pandas._libs.tslibs.timedeltas.Timedelta'> to Timestamp
Again using DateOffset or other time-delta functions result with the similar error.
But using variables to set target date (as @jezrael mentioned) works.
I will use variables as I don't see any other way right now but I will keep this Question open to see if there is any better way.
Upvotes: 0
Views: 6560
Reputation: 862511
Use DateOffset
for last N
months, both conditions are chained by |
for bitwise OR
:
now = pd.to_datetime('now')
last2 = now - pd.DateOffset(months=2)
last3 = now - pd.DateOffset(months=3)
df = df.query("(@now >= time1 >= @last2) | (@now >= time2 >= @last3)")
print(df)
time1 time2
315 2019-11-12 2018-11-12
316 2019-11-13 2018-11-13
317 2019-11-14 2018-11-14
318 2019-11-15 2018-11-15
319 2019-11-16 2018-11-16
.. ... ...
736 2021-01-06 2020-01-07
737 2021-01-07 2020-01-08
738 2021-01-08 2020-01-09
739 2021-01-09 2020-01-10
740 2021-01-10 2020-01-11
[153 rows x 2 columns]
Another solution without query
with boolean indexing
:
df = df[(df.time1 <= pd.to_datetime('now')) &
(df.time1 >= pd.to_datetime('now')- pd.DateOffset(months=2))]
Upvotes: 7