blehman
blehman

Reputation: 1970

Pandas max date by row?

The solution to the question asked here unfortunately does not solve this problem. I'm using Python 3.6.2

The Dataframe, df:

                            date1                        date2
rec0    2017-05-25 14:02:23+00:00    2017-05-25 14:34:43+00:00
rec1                          NaT    2017-05-16 19:37:43+00:00

To reproduce the problem:

import psycopg2
import pandas as pd
Timestamp = pd.Timestamp
NaT = pd.NaT

df = pd.DataFrame({'date1': [Timestamp('2017-05-25 14:02:23'), NaT],
                   'date2': [Timestamp('2017-05-25 14:34:43'), Timestamp('2017-05-16 19:37:43')]})

tz = psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)
for col in ['date1', 'date2']:
    df[col] = pd.DatetimeIndex(df[col]).tz_localize(tz)
print(df.max(axis=1))

Both of the above columns have been converted using pd.to_datetime() to get the following column type: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)]

Running df.max(axis=1) doesn't give an error but certainly provides the incorrect solution.

Output (incorrect):

rec0   NaN
rec1   NaN
dtype: float64

The fix that I have in place is to apply a custom function to the df as written below:

def get_max(x):
    test = x.dropna()
    return max(test)
df.apply(get_max,axis=1)

Output (correct):

rec0   2017-05-25 14:34:43+00:00
rec1   2017-05-16 19:37:43+00:00
dtype: datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)]

Maybe df.max() doesn't deal with date objects but only looks for floats (docs). Any idea why df.max(axis=1) only returns NaN?

Upvotes: 2

Views: 2752

Answers (2)

AndyP1970
AndyP1970

Reputation: 205

Using Pandas 1.0.5 with Python 3.8 I was still getting a series of Nans. Solved the issue by converting both columns to datetime and then adding skipna=True and numeric_only=False to the max() function:

df['1'] = pd.to_datetime(df['1'], utc=True)
df['2'] = pd.to_datetime(df['2'], utc=True) 
df['3'] = df[['1', '2']].max(axis=1, skipna=True, numeric_only=False)

Upvotes: 0

FabienP
FabienP

Reputation: 3138

After some testing, it looks like there is something wrong with pandas and psycopg2.tz.FixedOffsetTimezone.

If you try df.max(axis=0) it will work as expected, but as you indicate df.max(axis=1) will return a series of NaN. If you do not use psycopg2.tz.FixedOffsetTimezone as tz, df.max(axis=1) will return the expected result.

Other manipulations will fail in this case, such as df.transpose.

Note that if you try df.values.max(axis=1), you will get the expected result. So numpy.array seems to be able to deal with this. You should search in pandas Github issues (like this one) and maybe consider opening a new one if you can't find a fix.

Another solution would be to drop psycopg2.tz.FixedOffsetTimezone, but you may have some reason to use this specifically.

Upvotes: 1

Related Questions