Alexis Eggermont
Alexis Eggermont

Reputation: 8155

Find closest event in one dataframe before an event in another dataframe

I've got the following dataframe df1 and df2. I would like get df3 out of joining them as described below.

df1 and df2 both contain events with timestamps, for specific machines.

In df3 I would like to have all of df1, but also add for each line the event timestamp in df2 for the same machine that is the closest to the timestamp of the line in df1, but BEFORE it. If there has not been a df2 event before the df1 event, then that new value can be blank.

So it's some sort of merging operation, except that the link between the two tables is an equality on "machine" but an inequality which should be minimized in one direction on the timestamp.

Here is the code to generate the sample dataframes:

import pandas as pd
df1=pd.DataFrame({"Machine":[0,2,3,0,2,3],"Status":["blah","foo","bar","blah","foo","bar"],"Date-time":["2014-02-20 11:00:19.0","2014-02-21 12:29:55.0","2014-02-20 11:00:21.0","2014-02-19 09:10:19.0","2014-02-18 12:19:47.0","2014-02-20 1:33:00.0"]})
df1["Date-time"]=pd.to_datetime(df1["Date-time"])

df2=pd.DataFrame({"Machine":[0,2,3,0,2,3],"Date of maintenance":["2014-02-20","2014-02-21","2014-02-20","2014-02-10","2014-02-07","2014-02-03"]})
df2["Date of maintenance"]=pd.to_datetime(df2["Date of maintenance"])

df3=pd.DataFrame({"Machine":[0,2,3,0,2,3],"Status":["blah","foo","bar","blah","foo","bar"],"Date-time":["2014-02-20 11:00:19.0","2014-02-21 12:29:55.0","2014-02-20 11:00:21.0","2014-02-19 09:10:19.0","2014-02-18 12:19:47.0","2014-02-20 1:33:00.0"],"Date of last maintenance":["2014-02-20","2014-02-21","2014-02-20","2014-02-10","2014-02-07","2014-02-20"]})

EDIT:

So I've got the following down. I have some duplicates in there but I should be able to easily take care of them. The big part missing is how to make this match by machine, rather than for the entire table.

import pandas as pd
import numpy as np
df1=pd.DataFrame({"Machine":[0,2,3,0,2,3,0,1,0],"Status":["blah","foo","bar","blah","foo","bar","blah","foo","bar"],"Date-time":["2014-02-20 11:00:19.0","2014-02-21 12:29:55.0","2014-02-20 11:00:21.0","2014-02-19 09:10:19.0","2014-02-18 12:19:47.0","2014-02-20 1:33:00.0","2014-02-07 04:10:19.0","2014-02-19 11:11:47.0","2014-03-20 1:23:00.0"]})
df1["Date-time"]=pd.to_datetime(df1["Date-time"])
df1=df1.sort(["Date-time"])
df1=df1.reset_index(drop=True)

df2=pd.DataFrame({"Machine":[0,2,3,0,2,3],"Date of maintenance":["2014-02-20","2014-02-21","2014-02-20","2014-02-10","2014-02-07","2014-02-03"]})
df2["Date of maintenance"]=pd.to_datetime(df2["Date of maintenance"])
df2=df2.sort(["Date of maintenance"])
df2=df2.reset_index(drop=True)


df2["searchsortindex"]=np.searchsorted(np.array(df1["Date-time"]), np.array(df2["Date of maintenance"]), side='left', sorter=None)
df3=pd.merge(df1,df2,how='left',left_index=True,right_on='searchsortindex')

Upvotes: 0

Views: 286

Answers (1)

John Zwinck
John Zwinck

Reputation: 249303

You can use numpy.searchsorted() for this. It assumes you have a sorted array (e.g. timestamps) and a second array which you want to "locate" in the first array.

Upvotes: 1

Related Questions