Reputation: 571
I am trying to write a highly efficient function that would take an average size dataframe (~5000 rows) and return a dataframe with column of the latest year (and same index) such that for each date index of the original dataframe the month containing that date is between some pre-specified start date (st_d) and end date (end_d). I wrote a code where the year is decremented till the month for a particular dateindex is within the desired range. However, it is really slow. For the dataframe with only 366 entries it takes ~0.2s. I need to make it at least an order of magnitude faster so that I can repeatedly apply it to tens of thousands of dataframes. I would very much appreciate any suggestions for this.
import pandas as pd
import numpy as np
import time
from pandas.tseries.offsets import MonthEnd
def year_replace(st_d, end_d, x):
tmp = time.perf_counter()
def prior_year(d):
# 100 is number of the years back, more than enough.
for i_t in range(100):
#The month should have been fully seen in one of the data years.
t_start = pd.to_datetime(str(d.month) + '/' + str(end_d.year - i_t), format="%m/%Y")
t_end = t_start + MonthEnd(1)
if t_start <= end_d and t_start >= st_d and t_end <= end_d and t_end >= st_d:
break
if i_t < 99:
return t_start.year
else:
raise BadDataException("Not enough data for Gradient Boosted tree.")
output = pd.Series(index = x.index, data = x.index.map(lambda tt: prior_year(tt)), name = 'year')
print("time for single dataframe replacement = ", time.perf_counter() - tmp)
return output
i = pd.date_range('01-01-2019', '01-01-2020')
x = pd.DataFrame(index = i, data=np.full(len(i), 0))
st_d = pd.to_datetime('01/2016', format="%m/%Y")
end_d = pd.to_datetime('01/2018', format="%m/%Y")
year_replace(st_d, end_d, x)
Upvotes: 1
Views: 256
Reputation: 10992
My advice is: avoid loop whenever you can and check out if an easier way is available.
If I do understand what you aim to do is:
For given
start
andstop
timestamps, find the latest (higher) timestampt
where month is given from index andstart <= t <= stop
I believe this can be formalized as follow (I kept your function signature for conveniance):
def f(start, stop, x):
assert start < stop
tmp = time.perf_counter()
def y(d):
# Check current year:
if start <= d.replace(day=1, year=stop.year) <= stop:
return stop.year
# Check previous year:
if start <= d.replace(day=1, year=stop.year-1) <= stop:
return stop.year-1
# Otherwise fail:
raise TypeError("Ooops")
# Apply to index:
df = pd.Series(index=x.index, data=x.index.map(lambda t: y(t)), name='year')
print("Tick: ", time.perf_counter() - tmp)
return df
It seems to execute faster as requested (almost two decades, we should benchmark to be sure, eg.: with timeit
):
Tick: 0.004744200000004639
There is no need to iterate, you can just check current and previous year. If it fails, it cannot exist a timestamp fulfilling your requirements.
If the day must be kept, then just remove the day=1
in replace
method. If you require cut criteria not being equal then modify inequalities accordingly. The following function:
def y(d):
if start < d.replace(year=stop.year) < stop:
return stop.year
if start < d.replace(year=stop.year-1) < stop:
return stop.year-1
raise TypeError("Ooops")
Returns the same dataframe as yours.
Upvotes: 1