Reputation: 320
What is a quickest way to manipulate date field in pandas data frame, such as replacing date's day value to the last day of the month. Currently I can do following but it takes long to run.
import calendar
consumption_data_monthly.DATE = consumption_data_monthly.DATE.apply(lambda x: x.replace(day=calendar.monthrange(x.year,x.month)[1]))
Upvotes: 3
Views: 509
Reputation: 394041
Use DateOffset
to add the month end to your dates:
In [25]:
df['DATE'] + MonthEnd()
from pandas.tseries.offsets import *
df['DATE'] + MonthEnd()
Out[25]:
0 2012-01-31
1 2012-02-29
2 2012-03-31
3 2012-04-30
4 2012-05-31
5 2012-06-30
6 2012-07-31
Name: DATE, dtype: datetime64[ns]
Timings
In [26]:
def four(df):
df['DATE'] = df['DATE'].values.astype('datetime64[M]') + np.array([1], dtype='timedelta64[M]') - np.array([1], dtype='timedelta64[D]')
return df
%timeit four(df)
%timeit df['DATE'] = MonthEnd()
1000 loops, best of 3: 206 µs per loop
The slowest run took 272.78 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 139 µs per loop
You can see that using offsets is faster than the suggested solution
On a 70K row df the timings are:
100 loops, best of 3: 5.69 ms per loop
100 loops, best of 3: 8 ms per loop
So the other solution is faster for larger dfs, here the syntax is cleaner
Upvotes: 1
Reputation: 862641
I think calendar.monthrange
is very effective and very quick, but vectorization is faster.
You can try convert column DATE
to month's numpy
array by values
and astype
, then add next month
and substract one day
:
df['DATE'] = df['DATE'].values.astype('datetime64[M]') +
np.array([1], dtype='timedelta64[M]') -
np.array([1], dtype='timedelta64[D]')
Timing len(df)=70000
:
In [468]: %timeit one(df)
1 loops, best of 3: 881 ms per loop
In [469]: %timeit two(df1)
1 loops, best of 3: 733 ms per loop
In [470]: %timeit three(df2)
1 loops, best of 3: 1.24 s per loop
In [471]: %timeit four(df3)
100 loops, best of 3: 6.61 ms per loop
In [472]: %timeit five(df4)
100 loops, best of 3: 8.76 ms per loop
Code:
import pandas as pd
import numpy as np
import calendar
import datetime
from pandas.tseries.offsets import *
d = {'DATE': {0: pd.Timestamp('2012-01-05 00:00:00'), 1: pd.Timestamp('2012-02-08 00:00:00'), 2: pd.Timestamp('2012-03-11 00:00:00'), 3: pd.Timestamp('2012-04-06 00:00:00'), 4: pd.Timestamp('2012-05-04 00:00:00'), 5: pd.Timestamp('2012-06-20 00:00:00'), 6: pd.Timestamp('2012-07-09 00:00:00')}}
df = pd.DataFrame(d)
print df
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
df4 = df.copy()
def one(df):
df.DATE = df.DATE.apply(lambda x: x.replace(day=calendar.monthrange(x.year,x.month)[1]))
return df
def two(df):
df['DATE'] = df['DATE'].map(lambda x: datetime.datetime(x.year, x.month, calendar.monthrange(x.year,x.month)[1]))
return df
def three(df):
df['DATE'] = df['DATE'].map(lambda x: datetime.datetime(x.year, x.month, x.days_in_month))
return df
def four(df):
df['DATE'] = df['DATE'].values.astype('datetime64[M]') + np.array([1], dtype='timedelta64[M]') - np.array([1], dtype='timedelta64[D]')
return df
def five(df):
df['DATE'] = df['DATE'] + MonthEnd()
return df
print one(df).head()
print two(df1).head()
print three(df2).head()
print four(df4).head()
Timing len(df)=7
:
In [475]: %timeit one(df)
The slowest run took 11.16 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 379 µs per loop
In [476]: %timeit two(df1)
The slowest run took 11.93 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 336 µs per loop
In [477]: %timeit three(df2)
1000 loops, best of 3: 398 µs per loop
In [478]: %timeit four(df3)
The slowest run took 19.07 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 159 µs per loop
In [479]: %timeit five(df4)
The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.26 ms per loop
Upvotes: 2