jake wong
jake wong

Reputation: 5228

pandas filtering and comparing dates

I have a sql file which consists of the data below which I read into pandas.

df = pandas.read_sql('Database count details', con=engine,
                     index_col='id', parse_dates='newest_available_date')

Output

id       code   newest_date_available
9793708  3514   2015-12-24
9792282  2399   2015-12-25
9797602  7452   2015-12-25
9804367  9736   2016-01-20
9804438  9870   2016-01-20

The next line of code is to get last week's date

date_before = datetime.date.today() - datetime.timedelta(days=7) # Which is 2016-01-20

What I am trying to do is, to compare date_before with df and print out all rows that is less than date_before

if (df['newest_available_date'] < date_before):
    print(#all rows)

Obviously this returns me an error

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How should I do this?

Upvotes: 42

Views: 202731

Answers (5)

Sairam Krish
Sairam Krish

Reputation: 11691

If the dataframe column is of timezone aware datatype like datetime64[ns], it needs to be compared with timezone aware data. Else it would fail with error like :

TypeError: Invalid comparison between dtype=datetime64[ns, UTC] and Timestamp

We could resolve by making both the data timezone naive however, it's not good. We could loose the correctness of the result.

Best way to fix this is by making timezone aware data on both sides to compare.

import pandas as pd

# Sample data
data = {
    'updated_at': pd.date_range(start='2019-12-31', periods=4, freq='D', tz='UTC'),
    'value': [10, 20, 30, 40]
}
df = pd.DataFrame(data)

# Convert updated_at to datetime64[ns, UTC]
df['updated_at'] = pd.to_datetime(df['updated_at']).dt.tz_localize('UTC')

# Define the incremental cursor and convert to timezone-aware datetime
incremental_cursor = '2020-01-01'
incremental_cursor_as_datetime = pd.to_datetime(incremental_cursor).tz_localize('UTC')

# Filter the DataFrame
filtered_df = df[df['updated_at'] > incremental_cursor_as_datetime]

print(filtered_df)

Upvotes: 0

cardamom
cardamom

Reputation: 7421

If you get this error while filtering for date:

TypeError: Invalid comparison between dtype=datetime64[ns] and date

When trying this answer:

date_before = datetime.date(2016, 1, 19)
df[df['newest_date_available'] < date_before]

...it may be that your date column contains the other popular date type in pandas. Fix it like this:

import numpy as np

date_before = np.datetime64(datetime.date(2016, 1, 19))
df[df['newest_date_available'] < date_before]

Upvotes: 0

onlyphantom
onlyphantom

Reputation: 9583

Using datetime.date(2019, 1, 10) works because pandas coerces the date to a date time under the hood.

This however, will no longer be the case in future versions of pandas.

From version 0.24 and up, it now issues a warning:

FutureWarning: Comparing Series of datetimes with 'datetime.date'. Currently, the 'datetime.date' is coerced to a datetime. In the future pandas will not coerce, and a TypeError will be raised.

The better solution is the one proposed on its official documentation as Pandas' replacement for Python's datetime.datetime object.

To provide an example referencing OP's initial dataset, this is how you would use it:

import pandas
cond1 = df.newest_date_available < pd.Timestamp(2016,1,10)
df.loc[cond1, ]

Upvotes: 30

rachwa
rachwa

Reputation: 2300

A bit late to the party but I think it is worth mentioning. If you are looking for a solution which dynamically considers the date a week ago, this might be helpful:

In [3]: df = pd.DataFrame({'alpha': list('ABCDE'), 'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')})

In [4]: df
Out[4]: 
  alpha  num       date
0     A    0 2022-06-30
1     B    1 2022-07-01
2     C    2 2022-07-02
3     D    3 2022-07-03
4     E    4 2022-07-04

In [5]: df.query('date < "%s"' % (pd.Timestamp.now().normalize() - pd.Timedelta(7, 'd')))
Out[5]: 
  alpha  num       date
0     A    0 2022-06-30
1     B    1 2022-07-01

Explanation:
I created a new df with newer dates. Today is 2022-07-09 (pd.Timestamp.now().normalize()) and seven days ago it was 2022-07-02 (pd.Timestamp.now().normalize() - pd.Timedelta(7, 'd')). query() returns only those observations where the dates in column date are smaller than 2022-07-02 using the string formatting operator %.
normalize() is important here to reset the time to midnight. Otherwise query() will also return observations equal to 2022-07-02, because:

# Timestamp('2022-07-09 17:53:03.078172') > Timestamp('2022-07-09 00:00:00')
In [6]: pd.Timestamp.now() > pd.Timestamp.now().normalize()
Out[6]: True

Upvotes: 4

Fabio Lamanna
Fabio Lamanna

Reputation: 21552

I would do a mask like:

a = df[df['newest_date_available'] < date_before]

If date_before = datetime.date(2016, 1, 19), this returns:

        id  code newest_date_available
0  9793708  3514            2015-12-24
1  9792282  2399            2015-12-25
2  9797602  7452            2015-12-25

Upvotes: 53

Related Questions