tural
tural

Reputation: 320

Manipulate date field pandas

What is a quickest way to manipulate date field in pandas data frame, such as replacing date's day value to the last day of the month. Currently I can do following but it takes long to run.

import calendar
consumption_data_monthly.DATE = consumption_data_monthly.DATE.apply(lambda x: x.replace(day=calendar.monthrange(x.year,x.month)[1]))

Upvotes: 3

Views: 509

Answers (2)

EdChum
EdChum

Reputation: 394041

Use DateOffset to add the month end to your dates:

In [25]:
df['DATE'] + MonthEnd()
from pandas.tseries.offsets import *
df['DATE'] + MonthEnd()

Out[25]:
0   2012-01-31
1   2012-02-29
2   2012-03-31
3   2012-04-30
4   2012-05-31
5   2012-06-30
6   2012-07-31
Name: DATE, dtype: datetime64[ns]

Timings

In [26]:
def four(df): 
    df['DATE'] = df['DATE'].values.astype('datetime64[M]') + np.array([1], dtype='timedelta64[M]') - np.array([1], dtype='timedelta64[D]')
    return df
​
%timeit four(df)
%timeit df['DATE'] = MonthEnd()
1000 loops, best of 3: 206 µs per loop
The slowest run took 272.78 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 139 µs per loop

You can see that using offsets is faster than the suggested solution

On a 70K row df the timings are:

100 loops, best of 3: 5.69 ms per loop
100 loops, best of 3: 8 ms per loop

So the other solution is faster for larger dfs, here the syntax is cleaner

Upvotes: 1

jezrael
jezrael

Reputation: 862641

I think calendar.monthrange is very effective and very quick, but vectorization is faster.

You can try convert column DATE to month's numpy array by values and astype, then add next month and substract one day:

df['DATE'] = df['DATE'].values.astype('datetime64[M]') + 
             np.array([1], dtype='timedelta64[M]') - 
             np.array([1], dtype='timedelta64[D]')

Timing len(df)=70000:

In [468]: %timeit one(df)
1 loops, best of 3: 881 ms per loop

In [469]: %timeit two(df1)
1 loops, best of 3: 733 ms per loop

In [470]: %timeit three(df2)
1 loops, best of 3: 1.24 s per loop

In [471]: %timeit four(df3)
100 loops, best of 3: 6.61 ms per loop

In [472]: %timeit five(df4)
100 loops, best of 3: 8.76 ms per loop

Code:

import pandas as pd
import numpy as np
import calendar
import datetime
from pandas.tseries.offsets import *

d = {'DATE': {0: pd.Timestamp('2012-01-05 00:00:00'), 1: pd.Timestamp('2012-02-08 00:00:00'), 2: pd.Timestamp('2012-03-11 00:00:00'), 3: pd.Timestamp('2012-04-06 00:00:00'), 4: pd.Timestamp('2012-05-04 00:00:00'), 5: pd.Timestamp('2012-06-20 00:00:00'), 6: pd.Timestamp('2012-07-09 00:00:00')}}
df = pd.DataFrame(d)
print df

df =  pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
df4 = df.copy()

def one(df):
    df.DATE = df.DATE.apply(lambda x: x.replace(day=calendar.monthrange(x.year,x.month)[1]))
    return df

def two(df):    
    df['DATE'] = df['DATE'].map(lambda x: datetime.datetime(x.year, x.month, calendar.monthrange(x.year,x.month)[1]))
    return df

def three(df):    
    df['DATE'] = df['DATE'].map(lambda x: datetime.datetime(x.year, x.month, x.days_in_month))
    return df

def four(df): 
    df['DATE'] = df['DATE'].values.astype('datetime64[M]') + np.array([1], dtype='timedelta64[M]') - np.array([1], dtype='timedelta64[D]')
    return df

def five(df):    
    df['DATE'] = df['DATE'] + MonthEnd()
    return df

print one(df).head()
print two(df1).head()
print three(df2).head()
print four(df4).head()

Timing len(df)=7:

In [475]: %timeit one(df)
The slowest run took 11.16 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 379 µs per loop

In [476]: %timeit two(df1)
The slowest run took 11.93 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 336 µs per loop

In [477]: %timeit three(df2)
1000 loops, best of 3: 398 µs per loop

In [478]: %timeit four(df3)
The slowest run took 19.07 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 159 µs per loop

In [479]: %timeit five(df4)
The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 1.26 ms per loop

Upvotes: 2

Related Questions