Most Efficient Way to Find Closest Date Between 2 Dataframes

Question

I have a an hourly weather dataset that I imported to pandas dataframe. In this dataframe, I have 2 columns as follow (in addition to other columns):

wd = pd.read_csv('hourlyweather.csv')  # wd is short for Weather Data
wd['Date and Time']= wd['Date and Time'].astype('datetime64[ns]')
wd['Date and Time (int)'] = wd['Date and Time'].astype('int') 
wd['Temperature Celsius'] = wd['Temperature Celsius'].astype('double')

I have also another dataset (for hourly Car Accidents) that has different data but similar columns as follow:

cd = pd.read_csv('accidents.csv')  # cd is short for Crime Data
cd['Occurred Date Time']= cd['Occurred Date Time'].astype('datetime64[ns]')
cd['Occurred Date Time (int)']= cd['Occurred Date Time'].astype('int') 
cd.insert(6,"Temp in Celsius"," ");

My objective is to find the weather temperature for each car accidents. Since I do not have the exact hourly temperature, I want to find the closest temperature for each accident from the weather dataset. So, for each accident I want to find the closest date & time from the weather dataset and then get the temperature of that date & time to insert it in the corresponding column in car accidents dataframe

I tried to do it through a FOR LOOP (it works fine) but it is taking a very long time to process. This is because I have more than 1 million car accidents. The following is my FOR LOOP:

for i in range((len(cd['Occurred Date Time (int)']))):
    sourceint =cd['Occurred Date Time (int)'][i]
    idx = wd['Date and Time (int)'].sub(sourceint).abs().idxmin()
    cd["Temp in Celsius"][i] = wd['Temperature Celsius'][idx]

Is there a more efficient way to do this without a FOR LOOP, which can execute faster for this amount of records?

Here are some demo to the CSV files above

jAguesses · Accepted Answer

You could merge the two dataframes using pd.merge_asof. You need to sort your left & right dataframe and remove duplicates.

cd['Occurred Date Time'] = pd.to_datetime(cd['Occurred Date Time'])
wd['Date and Time'] = pd.to_datetime(wd['Date and Time'])

wd.drop_duplicates(subset = ['Date and Time'], inplace = True)
wd.sort_values(by = ['Date and Time'],inplace=True)
cd.drop_duplicates(subset = ['Occurred Date Time'], inplace = True)
cd.sort_values(by = ['Occurred Date Time'],inplace=True)

df = pd.merge_asof(cd,wd, left_on = 'Occurred Date Time', right_on = 'Date and Time')

Most Efficient Way to Find Closest Date Between 2 Dataframes

Answers (2)

Related Questions