AMM
AMM

Reputation: 17930

Filtering Pandas DataFrames on dates

I have a Pandas DataFrame with a 'date' column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.

What is the best way to achieve this?

Upvotes: 356

Views: 975449

Answers (18)

Sanjay Manohar
Sanjay Manohar

Reputation: 7026

Unfortunately navigating datetimes in python is poorly explained and boils down to trial-and-error. I really wanted to use string comparison like previous answers. However my input data was a csv where some rows included a time zone. So I had used

df.date = pd.to_datetime(df.date, format='mixed') 

However, I could not use the string indexing or comparison:

df.query('date > "20241023"')

TypeError: '>' not supported between instances of 'datetime.datetime' and 'str'

It turns out that this confusingly-worded error was actually because the column had

df.data.dtype:     dtype('O')

despite all the elements being datetimes. so I tried converting the column by force:

df.date = df.date.astype(datetime64)

Now I got the error

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True, at position 3

So next I tried:

df.date = pd.to_datetime(df.date, format='mixed', utf=True)

Now when I tried

adm.app_glh.recordingStartTimestamp.astype(datetime64)

I got

Cannot use .astype to convert from timezone-aware dtype to timezone-naive dtype. 

So I tried to localize them:

adm.app_glh.recordingStartTimestamp.apply(lambda x: pd.to_datetime(x).tz_localize('GMT'))

But this led to the error:

TypeError: Cannot localize tz-aware Timestamp, use tz_convert for conversions

So finally I tried removing the timezones, by doing

df.date=df.date.dt.tz_convert(None)

And now, at last, this worked!

df.date = pd.to_datetime(df.date, format='mixed', utc=True) 
df.query('date > 20241023') 

Upvotes: 0

Metin AKTAŞ
Metin AKTAŞ

Reputation: 66

--- or use an automation function based on the specified current time ---

import pandas as pd

# set current_time
current_time = pd.to_datetime("2024/09/08")

# set function for have time_index_df or have time_column_df
def get_last_two_months(df:pd.DataFrame, current_time, time_col:str=None, has_time_index:bool=False) -> pd.DataFrame:

    # calculate last_time for two months and set date_ranges !
    last_time = current_time + pd.Timedelta(60, "d")
    last_two_months = pd.date_range(current_time, last_time, freq="d").date

    # if dataset has time_index !
    if has_time_index:
        df.index = df.index.astype("datetime64[ns]").date
        return df.loc[df.index.isin(last_two_months)]

    # if dataset has time_column !
    elif time_col:
        df[time_col] = df[time_col].astype("datetime64[ns]").dt.date
        return df[df[time_col].isin(last_two_months)]

Upvotes: 0

Martin Maati
Martin Maati

Reputation: 171

import pandas as pd

STEP 1: convert the date column into a pandas datetime using pd.to_datetime()

df['date']=pd.to_datetime(df["date"],unit='s')

STEP 2: perform the filtering in any predetermined manner ( i.e 2 months)

df = df[(df["date"] >"2022-03-01" & df["date"] < "2022-05-03")]

Upvotes: 8

fantabolous
fantabolous

Reputation: 22706

If the dates are in the index then simply:

df['20160101':'20160301']

Edit: while short, this style is now deprecated (at least as of pandas 1.5.3) and the recommended style is df.loc['20160101':'20160301'] (as in other answers)

Upvotes: 26

DSBLR
DSBLR

Reputation: 623

# 60 days from today
after_60d = pd.to_datetime('today').date() + datetime.timedelta(days=60)
# filter date col less than 60 days date
df[df['date_col'] < after_60d]

Upvotes: 2

Jerin Mathew
Jerin Mathew

Reputation: 760

If you have already converted the string to a date format using pd.to_datetime you can just use:

df = df[(df['Date'] > "2018-01-01") & (df['Date'] < "2019-07-01")]

Upvotes: 60

Paul
Paul

Reputation: 7335

In pandas version 1.1.3 I encountered a situation where the python datetime based index was in descending order. In this case

df.loc['2021-08-01':'2021-08-31']

returned empty. Whereas

df.loc['2021-08-31':'2021-08-01']

returned the expected data.

Upvotes: 4

Marcel Flygare
Marcel Flygare

Reputation: 887

Another solution if you would like to use the .query() method.

It allows you to use write readable code like .query(f"{start} < MyDate < {end}") on the trade off, that .query() parses strings and the columns values must be in pandas date format (so that it is also understandable for .query())

df = pd.DataFrame({
     'MyValue': [1,2,3],
     'MyDate': pd.to_datetime(['2021-01-01','2021-01-02','2021-01-03'])
})
start = datetime.date(2021,1,1).strftime('%Y%m%d')
end = datetime.date(2021,1,3).strftime('%Y%m%d')
df.query(f"{start} < MyDate < {end}")

(following the comment from @Phillip Cloud, answer from @Retozi)

Upvotes: 4

Ekrem Gurdal
Ekrem Gurdal

Reputation: 1258

The shortest way to filter your dataframe by date: Lets suppose your date column is type of datetime64[ns]

# filter by single day
df_filtered = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']

# filter by single month
df_filtered = df[df['date'].dt.strftime('%Y-%m') == '2014-01']

# filter by single year
df_filtered = df[df['date'].dt.strftime('%Y') == '2014']

Upvotes: 39

Ernesto Lopez Fune
Ernesto Lopez Fune

Reputation: 583

You could just select the time range by doing: df.loc['start_date':'end_date']

Upvotes: 5

uhetz
uhetz

Reputation: 412

I'm not allowed to write any comments yet, so I'll write an answer, if somebody will read all of them and reach this one.

If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:

df.loc[df.index.month == 3]

That will filter the dataset for you by March.

Upvotes: 13

danielhrisca
danielhrisca

Reputation: 730

You can use pd.Timestamp to perform a query and a local reference

import pandas as pd
import numpy as np

df = pd.DataFrame()
ts = pd.Timestamp

df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')

print(df)
print(df.query('date > @ts("20190515T071320")')

with the output

                 date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25


                 date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25

Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing @ prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string

Upvotes: 20

Harry
Harry

Reputation: 1277

So when loading the csv data file, we'll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().

If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:

import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost'] 

This has been tested working for Python 3.7. Hope you will find this useful.

Upvotes: 11

pakira79
pakira79

Reputation: 131

How about using pyjanitor

It has cool features.

After pip install pyjanitor

import janitor

df_filtered = df.filter_date(your_date_column_name, start_date, end_date)

Upvotes: 4

Retozi
Retozi

Reputation: 7891

If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.

For example:

df.loc['2014-01-01':'2014-02-01']

See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection

If the column is not the index you have two choices:

  1. Make it the index (either temporarily or permanently if it's time-series data)
  2. df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]

See here for the general explanation

Note: .ix is deprecated.

Upvotes: 485

orange1
orange1

Reputation: 2939

Previous answer is not correct in my experience, you can't pass it a simple string, needs to be a datetime object. So:

import datetime 
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]

Upvotes: 91

VMAtm
VMAtm

Reputation: 28356

If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:

from datetime import date

import pandas as pd

value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]

Upvotes: 33

shm2008
shm2008

Reputation: 1473

And if your dates are standardized by importing datetime package, you can simply use:

df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]  

For standarding your date string using datetime package, you can use this function:

import datetime
datetime.datetime.strptime

Upvotes: 74

Related Questions