Reputation: 59
I have 2 dataframes df and df1 -
df-
|system_time|status|id|date|
|2022-03-04T07:52:26Z|Pending|772|2022-03-04 07:52:26+00:00|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42+00:00|
|2022-08-13T01:34:44Z|Pending|1052|2022-08-13 01:34:44+00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000+00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000+00:00|
|2022-03-04T17:51:15.025Z|Pending|772|2022-03-04 17:51:15.025000+00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000+00:00|
df1-
|id|task_status|
|1052|Complete|
|889|Pending|
|772|Complete|
|963|Pending|
Type of columns in df -
system_time - object
status - object
id - int64
date - object
I want to apply a lookup here from df into df1. If id matches in df and df1,status in df should be of task_status from df1. As there are duplicate records in df, need to get the latest record and update the status as of df1 else keep status same as df for unmatched id's. In df, I have converted the system_time into date column using -
df['date']=pd.to_datetime(df['system_time'])
Expected output -
|system_time|status|id|date|
|2022-06-22T17:52:42Z|Pending|963|2022-06-22 17:52:42+00:00|
|2022-08-24T01:46:31.115Z|Complete|1052|2022-08-24 01:46:31.115000+00:00|
|2022-08-14T06:04:54.736Z|Pending|1053|2022-08-14 06:04:54.736000+00:00|
|2022-03-04T17:51:15.025Z|Complete|772|2022-03-04 17:51:15.025000+00:00|
|2022-08-24T06:24:54.736Z|Inprogress|999|2022-08-24 06:24:54.736000+00:00|
Upvotes: 2
Views: 1486
Reputation: 11650
here is one way to do it using map
# map the df1 status to df when ID is found
df['status']=df['i'].map(df1.set_index(['id'])['task_status'])
df
system_time status id date
0 2022-03-04T07:52:26Z Complete 772 2022-03-04 07:52:26+00:00
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42+00:00
2 2022-08-13T01:34:44Z Complete 1052 2022-08-13 01:34:44+00:00
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000+00:00
4 2022-08-14T06:04:54.736Z NaN 1053 2022-08-14 06:04:54.736000+00:00
5 2022-03-04T17:51:15.025Z Complete 772 2022-03-04 17:51:15.025000+00:00
6 2022-08-24T06:24:54.736Z NaN 999 2022-08-24 06:24:54.736000+00:00
Alternately, if you like to update only when status is found in DF1
df['status']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).notna()),
(df['id'].map(df1.set_index(['id'])['task_status'])) )
df
system_time status id date
0 2022-03-04T07:52:26Z Complete 772 2022-03-04 07:52:26+00:00
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42+00:00
2 2022-08-13T01:34:44Z Complete 1052 2022-08-13 01:34:44+00:00
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000+00:00
4 2022-08-14T06:04:54.736Z Pending 1053 2022-08-14 06:04:54.736000+00:00
5 2022-03-04T17:51:15.025Z Complete 772 2022-03-04 17:51:15.025000+00:00
6 2022-08-24T06:24:54.736Z Inprogress 999 2022-08-24 06:24:54.736000+00:00
for unmatched,
#update 'unmatched' column as unmatched when id is NOT found.
#When its found, keep the status as-is.
#you may want to keep the previous one and this one together
df['unmatched']=df['status'].mask((df['id'].map(df1.set_index(['id'])['task_status']).isna()),
'unmatched' )
df
system_time status id date unmatched
0 2022-03-04T07:52:26Z Complete 772 2022-03-04 07:52:26+00:00 Complete
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42+00:00 Pending
2 2022-08-13T01:34:44Z Complete 1052 2022-08-13 01:34:44+00:00 Complete
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000+00:00 Complete
4 2022-08-14T06:04:54.736Z Pending 1053 2022-08-14 06:04:54.736000+00:00 unmatched
5 2022-03-04T17:51:15.025Z Complete 772 2022-03-04 17:51:15.025000+00:00 Complete
6 2022-08-24T06:24:54.736Z Inprogress 999 2022-08-24 06:24:54.736000+00:00 unmatched
to keep the last row based on the system-time
df.sort_values('system_time').drop_duplicates(subset=['id'], keep='last')
system_time status id date unmatched
5 2022-03-04T17:51:15.025Z Pending 772 2022-03-04 17:51:15.025000+00:00 Pending
1 2022-06-22T17:52:42Z Pending 963 2022-06-22 17:52:42+00:00 Pending
4 2022-08-14T06:04:54.736Z Pending 1053 2022-08-14 06:04:54.736000+00:00 unmatched
3 2022-08-24T01:46:31.115Z Complete 1052 2022-08-24 01:46:31.115000+00:00 Complete
6 2022-08-24T06:24:54.736Z Inprogress 999 2022-08-24 06:24:54.736000+00:00 unmatched
Upvotes: 2
Reputation: 6564
You need to use merge
:
df = df.merge(right=df1, on='id',how='left')
Upvotes: 0