Reputation: 11765

Transform pandas Timestamp to end of current month

This is a follow-up to a this question, which was asked a few years ago. The output I'm getting makes me think date offsets in pandas have changed during that time.

I have dates, and I want to move them to the end of the month that they're in. For example:

df = pd.DataFrame([pd.Timestamp('2014-01-15'), pd.Timestamp('2014-01-31')],
                    columns=['orig'])

I want to transform both of these to 2014-01-31. I'm wondering the pandamic way to do this.

The accepted answer in the other question offers two solutions:

1) pd.Index(df.orig).to_period('M').to_timestamp('M')

2) pd.Index(df.orig) + pd.offsets.MonthEnd(0)

However, in pandas 0.17.1 these do not both give the same answer. The first approach works, but the second does something different:

df[0] = df['orig'] + pd.offsets.MonthEnd(0)
df[1] = df['orig'] + pd.offsets.MonthEnd(1)

    orig        0           1
0   2014-01-15  2013-12-31  2014-01-31
1   2014-01-31  2013-12-31  2014-02-28

So MonthEnd(0) moves all dates to the end of the previous month, while MonthEnd(1) move dates to the end of the current month except if it's already the last day of the month, in which case it moves it to the end of this next month. This seems quite strange.

So, is (1) really the best way to do this? I'm surprised that there's not simple way to accomplish the same thing using offsets.

Upvotes: 4

Answers (2)

Stefan

Reputation: 42905

It does work using .rollforward() - but there are better alternatives as laid out and timed by @unutbu:

from pandas.tseries.offsets import *

df = pd.DataFrame([pd.Timestamp('2014-01-15'), pd.Timestamp('2014-01-31')], columns=['orig'])
df['month_end'] = df.orig.apply(lambda x: MonthEnd().rollforward(x))

        orig  month_end
0 2014-01-15 2014-01-31
1 2014-01-31 2014-01-31

because the function is smart enough to check whether date is sitting on an offset:

def rollforward(self, dt):
    """Roll provided date forward to next offset only if not on offset"""
    dt = as_timestamp(dt)
    if not self.onOffset(dt):
        dt = dt + self.__class__(1, normalize=self.normalize, **self.kwds)
    return dt

Upvotes: 2

unutbu

Reputation: 880877

Here are few alternatives:

import numpy as np
import pandas as pd
import pandas.tseries.offsets as offsets

ONE_MONTH = np.array([1], dtype='timedelta64[M]')
ONE_DAY = np.array([1], dtype='timedelta64[D]')

df = pd.DataFrame(pd.to_datetime(['2014-01-15', '2014-01-31', '2014-02-01']),
                    columns=['orig'])

df['using_datetime64'] = df['orig'].values.astype('datetime64[M]') + ONE_MONTH - ONE_DAY
df['using_to_period'] = pd.Index(df['orig']).to_period('M').to_timestamp('M')
df['using_dateoffset'] = df['orig'] + offsets.DateOffset(day=31)
df['using_rollforward'] = df['orig'].apply(lambda x: offsets.MonthEnd().rollforward(x))

Of these,

df['orig'].values.astype('datetime64[M]') + ONE_MONTH - ONE_DAY

is the fastest

In [108]: df = pd.DataFrame({'orig': np.arange(10000).astype('<i8').view('<datetime64[D]')})

In [109]: %timeit df['using_datetime64'] = df['orig'].values.astype('datetime64[M]') + ONE_MONTH - ONE_DAY
1000 loops, best of 3: 913 µs per loop

In [110]: %timeit df['using_to_period'] = pd.Index(df['orig']).to_period('M').to_timestamp('M')
1000 loops, best of 3: 1.95 ms per loop

In [111]: %timeit df['using_dateoffset'] = df['orig'] + offsets.DateOffset(day=31)
1 loops, best of 3: 240 ms per loop

In [112]: %timeit df['using_rollforward'] = df['orig'].apply(lambda x: offsets.MonthEnd().rollforward(x))
1 loops, best of 3: 813 ms per loop

Note that using_datetime64, using_to_period, and using_dateoffset work even if df['orig'] contains NaT values. using_rollforward raises ValueError: cannot convert float NaN to integer.

Upvotes: 4

Transform pandas Timestamp to end of current month

Answers (2)

Related Questions