Reputation: 5221
I'm trying to either overlay two dataframes, overwriting one with values from the other if the value in several columns match. I don't care about the indexes of the dataframes, what is pertinent is that content in the columns match for the overwrite to occur. It'd also be acceptable to remove the rows from one dataframe when this match occurs and just concat the remaining rows.
Below are two mocked up dataframes with simplified column names for brevity. The first is the "background" dataframe, the one on which I wish to overlay content from the other,
Name1 Name2 Id1 Id2 Attr1 Attr2
3 A B 0 0 None None
1 A B 0 1 None None
7 A B 0 2 None None
15 A B 0 3 None None
13 A B 1 0 None None
14 A B 1 1 None None
0 A B 1 2 None None
4 A B 1 3 None None
10 A B 2 0 None None
9 A B 2 1 None None
12 A B 2 2 None None
11 A B 2 3 None None
6 A B 3 0 None None
8 A B 3 1 None None
2 A B 3 2 None None
5 A B 3 3 None None
The second contains some data
Name1 Name2 Id1 Id2 Attr1 Attr2
0 A B 0 0 LEVEL_A 1.00
1 A B 0 1 LEVEL_A 0.97
2 A B 1 1 LEVEL_A 1.00
3 A B 2 2 LEVEL_A 1.00
4 A B 3 3 LEVEL_A 1.00
I included the indices to show that they do not match. The expected result would contain this data:
Name1 Name2 Id1 Id2 Attr1 Attr2
3 A B 0 0 Level_A 1.00 # Name1,Name2,Id1,Id2 matches
1 A B 0 1 Level_A 0.97
7 A B 0 2 None None
15 A B 0 3 None None
13 A B 1 0 None None
14 A B 1 1 Level_A 1.00
0 A B 1 2 None None
4 A B 1 3 None None
10 A B 2 0 None None
9 A B 2 1 None None
12 A B 2 2 Level_A 1.00
11 A B 2 3 None None
6 A B 3 0 None None
8 A B 3 1 None None
2 A B 3 2 None None
5 A B 3 3 Level_A 1.00
Note: it's fine if the rows are in a different order in the end result. That would be expected if one finds all rows in the first dataframe such that for any given row its values for the columns Name1,Name2,Id1,Id2
are also present in the second dataframe.
Bonus: I'd be very interested to know if it is possible to do this if one didn't use a sentinel value such as None
and if possible, how?
I keep getting it wrong because combine_first
, isin
et al. all look at indices. Am dangerously close to doing this using some iterrow
-approach but that feels hacky and I don't want to unit-test my unit-tests. Pandemic brain is making me not think as good.
Upvotes: 0
Views: 322
Reputation: 345
check_cols = 'Name1 Name2 Id1 Id2'.split()
val_cols = 'Attr1 Attr2'.split()
Create unique ids
background_df['full_id'] = background_df[check_cols].apply(lambda row: '_'.join([str(_).strip() for _ in row]), axis=1)
data_df['full_id'] = data_df[check_cols].apply(lambda row: '_'.join([str(_).strip() for _ in row]), axis=1)
Merge on the unique id. You can change suffixes to help you drop the extra columns later
background_df.merge(data_df, how='left', on='full_id', suffixes=('_x', '_y'))
then you would drop whatever columns you don't want OR just start the merge with fewer columns.
background_df[check_cols+['full_id',]].merge(data_df[val_cols+['full_id',]], how='left', on='full_id', suffixes=('_x', '_y'))
There is the extra thing you do not talk about. Do you have info on cols Attr1 and Attr2 on background_df that you would wish to keep instead of what might be in data_df? If so, then this would need further checking on both _x and _y attr cols.
I like to use iterrows, but you specifically said no to that.
Upvotes: 1