Reputation: 11
I have a dataframe (df
) with a single column of dates and a second dataframe (df_value
) with three columns: a start date ('From'), an end date ('To') and an associated value. I want to create a second column in df
with the correct value which has been looked up from df_value
:
import pandas as pd
df = pd.DataFrame(['30/03/2018', '01/10/2019','03/07/2020', '05/08/2020', '06/08/2020', '10/10/2020'], columns=['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y', dayfirst=True).dt.date
df_value = pd.DataFrame([['01/01/2018','31/12/2018',1.286], ['01/01/2019','30/06/2019',1.555], ['01/07/2019','31/12/2019',1.632], ['01/01/2020','31/12/2020',1.864]], columns =['From', 'To', 'Value'])
df_value['From'] = pd.to_datetime(df_value['From'], format='%d/%m/%Y', dayfirst=True).dt.date
df_value['To'] = pd.to_datetime(df_value['To'], format='%d/%m/%Y', dayfirst=True).dt.date
At the moment I have done this through applying the function below to df
row-by-row. Although this works I feel that there must be a far more efficient way of doing this:
def fixed_func(df):
value = 0
row_counter = 0
while value == 0:
if (df['Date']>= df_value.iloc[row_counter, 0]) & (df['Date']<= df_value.iloc[row_counter, 1]):
value = df_value.iloc[row_counter, 2]
else:
row_counter += 1
return value
df['Value'] = df.apply(fixed_func, axis=1)
Upvotes: 1
Views: 740
Reputation: 1932
Here is a detailed answer, comparing your current approach and another proposed by me:
# Setup data
import pandas as pd
import numpy as np
df = pd.DataFrame(['30/03/2018', '01/10/2019','03/07/2020', '05/08/2020', '06/08/2020', '10/10/2020'], columns=['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y', dayfirst=True).dt.date
print(df)
df_value = pd.DataFrame([['01/01/2018','31/12/2018',1.286], ['01/01/2019','30/06/2019',1.555], ['01/07/2019','31/12/2019',1.632], ['01/01/2020','31/12/2020',1.864]], columns =['From', 'To', 'Value'])
df_value['From'] = pd.to_datetime(df_value['From'], format='%d/%m/%Y', dayfirst=True).dt.date
df_value['To'] = pd.to_datetime(df_value['To'], format='%d/%m/%Y', dayfirst=True).dt.date
print(df_value)
df2 = df.copy()
df3 = df.copy()
Now defining your approach as a function:
def existing():
#df = pd.concat([df, df_value], axis=1)
def fixed_func(df):
value = 0
row_counter = 0
while value == 0:
if (df['Date']>= df_value.iloc[row_counter, 0]) & (df['Date']<= df_value.iloc[row_counter, 1]):
value = df_value.iloc[row_counter, 2]
else:
row_counter += 1
return value
df2['Value'] = df2.apply(fixed_func, axis=1)
Here is a version proposed by me:
def proposed():
def rangecheck(dt):
t = df_value['Value'][(df_value['From']<=dt) & (dt<=df_value['To'])].tolist()[0]
return t
df3['Value'] = df3['Date'].map(rangecheck)
The timing profile for both the methods is as follows:
[IN]> %timeit existing()
[OUT]> 1.68 ms ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
[IN]> %timeit proposed()
[OUT]> 1.97 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Thus, my approach is not better than your existing one, but, looks pythonic :)
Upvotes: 0
Reputation: 4761
You can use numpy.where
:
import numpy as np
date_col = df.Date.values[:, np.newaxis]
x,y = np.where((date_col >= df_value.From.values) &
(date_col <= df_value.To.values))
df["Value"] = df_value.loc[y,"Value"].reset_index(drop = True)
# Date Value
#0 2018-03-30 1.286
#1 2019-10-01 1.632
#2 2020-07-03 1.864
#3 2020-08-05 1.864
#4 2020-08-06 1.864
#5 2020-10-10 1.864
Upvotes: 2