Reputation: 55
I have two dataframes. The dataframe A
contains the information about a trip:
Id Name StartTime EndTime
0 201 Car1 2016-01-01 00:00:00 2016-01-01 00:43:05
1 205 Car2 2016-01-01 00:10:00 2016-01-01 00:45:05
2 345 Car3 2016-01-01 00:01:00 2016-01-01 00:47:05
3 456 Car2 2016-01-02 00:00:00 2016-01-02 02:45:05
4 432 Car1 2016-01-02 00:00:00 2016-01-02 02:47:05
The dataframe B
contains timestamps during the trip (like a gps).
Name Timestamp
0 Car1 2016-01-01 00:05:00
1 Car1 2016-01-01 00:05:24
2 Car2 2016-01-01 00:10:04
3 Car3 2016-01-01 00:01:04
4 Car2 2016-01-01 00:10:34
5 Car1 2016-01-01 00:05:54
I need to add a column to Dataframe B
called Id
which picks up the Id from dataframe A
based on the name and the start and end-times in frame A
. Both these dataframes are really large, so I need an efficient way of doing this.
Upvotes: 4
Views: 183
Reputation: 61957
This appears to be a situation for the recently added merge_asof
. That takes the left dataframe (dataframe B here) and does the following:
For each row in the left DataFrame, we select the last row in the right DataFrame whose 'on' key is less than or equal to the left's key. Both DataFrames must be sorted by the key.
make sure we have timestamp data
dfa['StartTime'] = pd.to_datetime(dfa.StartTime)
dfa['EndTime'] = pd.to_datetime(dfa.EndTime)
dfb['Timestamp'] = pd.to_datetime(dfb.Timestamp)
Then sort joining columns
dfb = dfb.sort_values('Timestamp')
dfa = dfa.sort_values('StartTime')
Perfom asof merge by 'Name'
pd.merge_asof(dfb, dfa, left_on='Timestamp', right_on='StartTime', by='Name')
Output
Name Timestamp Id StartTime EndTime
0 Car3 2016-01-01 00:01:04 345 2016-01-01 00:01:00 2016-01-01 00:47:05
1 Car1 2016-01-01 00:05:00 201 2016-01-01 00:00:00 2016-01-01 00:43:05
2 Car1 2016-01-01 00:05:24 201 2016-01-01 00:00:00 2016-01-01 00:43:05
3 Car1 2016-01-01 00:05:54 201 2016-01-01 00:00:00 2016-01-01 00:43:05
4 Car2 2016-01-01 00:10:04 205 2016-01-01 00:10:00 2016-01-01 00:45:05
5 Car2 2016-01-01 00:10:34 205 2016-01-01 00:10:00 2016-01-01 00:45:05
Upvotes: 1
Reputation: 862611
I think you need merge
with outer join on column Name
, then filter by boolean indexing
and last remove columns by drop
:
df = pd.merge(df1, df2, on='Name', how='outer')
df = df[(df.StartTime <= df.Timestamp) & (df.EndTime >= df.Timestamp)]
df = df.drop(['StartTime','EndTime'], axis=1)
print (df)
Id Name Timestamp
0 201 Car1 2016-01-01 00:05:00
1 201 Car1 2016-01-01 00:05:24
2 201 Car1 2016-01-01 00:05:54
6 205 Car2 2016-01-01 00:10:04
7 205 Car2 2016-01-01 00:10:34
10 345 Car3 2016-01-01 00:01:04
Upvotes: 1