Reputation: 11765
This is a follow-up to a this question, which was asked a few years ago. The output I'm getting makes me think date offsets in pandas have changed during that time.
I have dates, and I want to move them to the end of the month that they're in. For example:
df = pd.DataFrame([pd.Timestamp('2014-01-15'), pd.Timestamp('2014-01-31')],
columns=['orig'])
I want to transform both of these to 2014-01-31
. I'm wondering the pandamic way to do this.
The accepted answer in the other question offers two solutions:
1) pd.Index(df.orig).to_period('M').to_timestamp('M')
2) pd.Index(df.orig) + pd.offsets.MonthEnd(0)
However, in pandas 0.17.1 these do not both give the same answer. The first approach works, but the second does something different:
df[0] = df['orig'] + pd.offsets.MonthEnd(0)
df[1] = df['orig'] + pd.offsets.MonthEnd(1)
orig 0 1
0 2014-01-15 2013-12-31 2014-01-31
1 2014-01-31 2013-12-31 2014-02-28
So MonthEnd(0)
moves all dates to the end of the previous month, while MonthEnd(1)
move dates to the end of the current month except if it's already the last day of the month, in which case it moves it to the end of this next month. This seems quite strange.
So, is (1) really the best way to do this? I'm surprised that there's not simple way to accomplish the same thing using offsets.
Upvotes: 4
Views: 2786
Reputation: 42905
It does work using .rollforward()
- but there are better alternatives as laid out and timed by @unutbu:
from pandas.tseries.offsets import *
df = pd.DataFrame([pd.Timestamp('2014-01-15'), pd.Timestamp('2014-01-31')], columns=['orig'])
df['month_end'] = df.orig.apply(lambda x: MonthEnd().rollforward(x))
orig month_end
0 2014-01-15 2014-01-31
1 2014-01-31 2014-01-31
because the function is smart enough to check whether date is sitting on an offset
:
def rollforward(self, dt): """Roll provided date forward to next offset only if not on offset""" dt = as_timestamp(dt) if not self.onOffset(dt): dt = dt + self.__class__(1, normalize=self.normalize, **self.kwds) return dt
Upvotes: 2
Reputation: 880877
Here are few alternatives:
import numpy as np
import pandas as pd
import pandas.tseries.offsets as offsets
ONE_MONTH = np.array([1], dtype='timedelta64[M]')
ONE_DAY = np.array([1], dtype='timedelta64[D]')
df = pd.DataFrame(pd.to_datetime(['2014-01-15', '2014-01-31', '2014-02-01']),
columns=['orig'])
df['using_datetime64'] = df['orig'].values.astype('datetime64[M]') + ONE_MONTH - ONE_DAY
df['using_to_period'] = pd.Index(df['orig']).to_period('M').to_timestamp('M')
df['using_dateoffset'] = df['orig'] + offsets.DateOffset(day=31)
df['using_rollforward'] = df['orig'].apply(lambda x: offsets.MonthEnd().rollforward(x))
Of these,
df['orig'].values.astype('datetime64[M]') + ONE_MONTH - ONE_DAY
is the fastest
In [108]: df = pd.DataFrame({'orig': np.arange(10000).astype('<i8').view('<datetime64[D]')})
In [109]: %timeit df['using_datetime64'] = df['orig'].values.astype('datetime64[M]') + ONE_MONTH - ONE_DAY
1000 loops, best of 3: 913 µs per loop
In [110]: %timeit df['using_to_period'] = pd.Index(df['orig']).to_period('M').to_timestamp('M')
1000 loops, best of 3: 1.95 ms per loop
In [111]: %timeit df['using_dateoffset'] = df['orig'] + offsets.DateOffset(day=31)
1 loops, best of 3: 240 ms per loop
In [112]: %timeit df['using_rollforward'] = df['orig'].apply(lambda x: offsets.MonthEnd().rollforward(x))
1 loops, best of 3: 813 ms per loop
Note that using_datetime64
, using_to_period
, and using_dateoffset
work even if df['orig']
contains NaT
values. using_rollforward
raises ValueError: cannot convert float NaN to integer
.
Upvotes: 4