Syed Mohammad Hosseini
Syed Mohammad Hosseini

Reputation: 522

Pandas - filter data that are in last n month using query

I want to filter my data frame based on time column using query function and get rows with time in last N month(or any time periods).

Basically I need something like:

df = df.query("time.isin(timeperiod('5M'))")

or

df = df.query("now() >= time >= now() - timeperiod('5M')")

I'm trying to create query patterns for my filter functions and It's important to use query() function.

Note that timeperiod() and now() functions are not available and I used them for better understanding of the solution I want.I mean please define these functions in your solution.

Sample:

df = pd.DataFrame()
df['time1'] = pd.date_range('1/1/2019', periods=1000, freq='D')
df['time2'] = pd.date_range('1/1/2018', periods=1000, freq='D')

Sample data can have some other columns.

I need to filter data to get records within last N month for each of given columns using query() function. For example I want records with time1 in last 2 months and time2 in last 3 months. Like getting records of people who were hired during last 3 month and were let go during last 2 months.

UPDATE 1:

thanks to @jezrael I could get now() function in query() like this:

df = df.query("@pd.Timestamp('now') >= time")

of course Timestamp could be replaced by 'to_datetime` or something else.

But the problem is when I try to get a timestamp for the start of period like this:

df = df.query("time >= (@pd.Timestamp('now') + @pd.to_timedelta('30day'))")

I get: Cannot convert input ... of type <class 'pandas._libs.tslibs.timedeltas.Timedelta'> to Timestamp

Again using DateOffset or other time-delta functions result with the similar error.

But using variables to set target date (as @jezrael mentioned) works.

I will use variables as I don't see any other way right now but I will keep this Question open to see if there is any better way.

Upvotes: 0

Views: 6560

Answers (1)

jezrael
jezrael

Reputation: 862511

Use DateOffset for last N months, both conditions are chained by | for bitwise OR:

now = pd.to_datetime('now')
last2 = now - pd.DateOffset(months=2)
last3 = now - pd.DateOffset(months=3)

df = df.query("(@now >= time1 >= @last2) | (@now >= time2 >= @last3)")
print(df)
         time1      time2
315 2019-11-12 2018-11-12
316 2019-11-13 2018-11-13
317 2019-11-14 2018-11-14
318 2019-11-15 2018-11-15
319 2019-11-16 2018-11-16
..         ...        ...
736 2021-01-06 2020-01-07
737 2021-01-07 2020-01-08
738 2021-01-08 2020-01-09
739 2021-01-09 2020-01-10
740 2021-01-10 2020-01-11

[153 rows x 2 columns]

Another solution without query with boolean indexing:

df = df[(df.time1 <= pd.to_datetime('now')) & 
        (df.time1 >= pd.to_datetime('now')- pd.DateOffset(months=2))]

Upvotes: 7

Related Questions