Reputation: 747
I have a an hourly weather dataset that I imported to pandas dataframe. In this dataframe, I have 2 columns as follow (in addition to other columns):
wd = pd.read_csv('hourlyweather.csv') # wd is short for Weather Data
wd['Date and Time']= wd['Date and Time'].astype('datetime64[ns]')
wd['Date and Time (int)'] = wd['Date and Time'].astype('int')
wd['Temperature Celsius'] = wd['Temperature Celsius'].astype('double')
I have also another dataset (for hourly Car Accidents) that has different data but similar columns as follow:
cd = pd.read_csv('accidents.csv') # cd is short for Crime Data
cd['Occurred Date Time']= cd['Occurred Date Time'].astype('datetime64[ns]')
cd['Occurred Date Time (int)']= cd['Occurred Date Time'].astype('int')
cd.insert(6,"Temp in Celsius"," ");
My objective is to find the weather temperature for each car accidents. Since I do not have the exact hourly temperature, I want to find the closest temperature for each accident from the weather dataset. So, for each accident I want to find the closest date & time from the weather dataset and then get the temperature of that date & time to insert it in the corresponding column in car accidents dataframe
I tried to do it through a FOR LOOP (it works fine) but it is taking a very long time to process. This is because I have more than 1 million car accidents. The following is my FOR LOOP:
for i in range((len(cd['Occurred Date Time (int)']))):
sourceint =cd['Occurred Date Time (int)'][i]
idx = wd['Date and Time (int)'].sub(sourceint).abs().idxmin()
cd["Temp in Celsius"][i] = wd['Temperature Celsius'][idx]
Is there a more efficient way to do this without a FOR LOOP, which can execute faster for this amount of records?
Here are some demo to the CSV files above
Upvotes: 1
Views: 1280
Reputation: 91
You could merge the two dataframes using pd.merge_asof. You need to sort your left & right dataframe and remove duplicates.
cd['Occurred Date Time'] = pd.to_datetime(cd['Occurred Date Time'])
wd['Date and Time'] = pd.to_datetime(wd['Date and Time'])
wd.drop_duplicates(subset = ['Date and Time'], inplace = True)
wd.sort_values(by = ['Date and Time'],inplace=True)
cd.drop_duplicates(subset = ['Occurred Date Time'], inplace = True)
cd.sort_values(by = ['Occurred Date Time'],inplace=True)
df = pd.merge_asof(cd,wd, left_on = 'Occurred Date Time', right_on = 'Date and Time')
Upvotes: 2
Reputation: 674
If the code is not totally understandable, please let me know
df_accident['datetime'] = df_accident['datetime'].apply(lambda x:pd.Timestamp(x))
df_accident['year'] = df_accident['datetime'].apply(lambda x:x.year)
df_accident['month']= df_accident['datetime'].apply(lambda x:x.month)
df_accident['day'] = df_accident['datetime'].apply(lambda x:x.day)
df_accident['hour'] = df_accident['datetime'].apply(lambda x:x.hour)
df_accident['minute'] = df_accident['datetime'].apply(lambda x:x.minute)
df_weather['datetime'] = df_weather['datetime'].apply(lambda x:pd.Timestamp(x))
df_weather['year'] = df_weather'datetime'].apply(lambda x:x.year)
df_weather['month']= df_weather['datetime'].apply(lambda x:x.month)
df_weather['day'] = df_weather['datetime'].apply(lambda x:x.day)
df_weather['hour'] = df_weather['datetime'].apply(lambda x:x.hour)
df_weather['minute'] = df_weather['datetime'].apply(lambda x:x.minute)
columns = ['year','month','day','hour','minute']
joint_dfs_array = []
for i in range(5):
cols = columns[:5-i]
joint_df = df_accident.merge(df_weather,on=cols,how='left')
joint_dfs_array.append(df_accident[~joint_df[['datetime','location','temp']]])
final_df = joint_dfs_array[0]
for i in range(1,len(joint_dfs_array)):
final_df = final_df.concat(joint_dfs_array[i],axis=0)
final_df is the ans.
Upvotes: 1