Filip Allberg
Filip Allberg

Reputation: 5221

Overlaying two Pandas DataFrames when one is partial

I'm trying to either overlay two dataframes, overwriting one with values from the other if the value in several columns match. I don't care about the indexes of the dataframes, what is pertinent is that content in the columns match for the overwrite to occur. It'd also be acceptable to remove the rows from one dataframe when this match occurs and just concat the remaining rows.

Below are two mocked up dataframes with simplified column names for brevity. The first is the "background" dataframe, the one on which I wish to overlay content from the other,

   Name1 Name2  Id1  Id2 Attr1 Attr2
3      A     B    0    0  None  None
1      A     B    0    1  None  None
7      A     B    0    2  None  None
15     A     B    0    3  None  None
13     A     B    1    0  None  None
14     A     B    1    1  None  None
0      A     B    1    2  None  None
4      A     B    1    3  None  None
10     A     B    2    0  None  None
9      A     B    2    1  None  None
12     A     B    2    2  None  None
11     A     B    2    3  None  None
6      A     B    3    0  None  None
8      A     B    3    1  None  None
2      A     B    3    2  None  None
5      A     B    3    3  None  None

The second contains some data

  Name1 Name2  Id1  Id2     Attr1  Attr2
0     A     B    0    0   LEVEL_A   1.00
1     A     B    0    1   LEVEL_A   0.97
2     A     B    1    1   LEVEL_A   1.00
3     A     B    2    2   LEVEL_A   1.00
4     A     B    3    3   LEVEL_A   1.00

I included the indices to show that they do not match. The expected result would contain this data:

   Name1 Name2  Id1  Id2 Attr1 Attr2
3      A     B    0    0  Level_A  1.00 # Name1,Name2,Id1,Id2 matches
1      A     B    0    1  Level_A  0.97
7      A     B    0    2  None  None
15     A     B    0    3  None  None
13     A     B    1    0  None  None
14     A     B    1    1  Level_A  1.00
0      A     B    1    2  None  None
4      A     B    1    3  None  None
10     A     B    2    0  None  None
9      A     B    2    1  None  None
12     A     B    2    2  Level_A  1.00
11     A     B    2    3  None  None
6      A     B    3    0  None  None
8      A     B    3    1  None  None
2      A     B    3    2  None  None
5      A     B    3    3  Level_A  1.00

Note: it's fine if the rows are in a different order in the end result. That would be expected if one finds all rows in the first dataframe such that for any given row its values for the columns Name1,Name2,Id1,Id2 are also present in the second dataframe.

Bonus: I'd be very interested to know if it is possible to do this if one didn't use a sentinel value such as None and if possible, how?

I keep getting it wrong because combine_first, isin et al. all look at indices. Am dangerously close to doing this using some iterrow-approach but that feels hacky and I don't want to unit-test my unit-tests. Pandemic brain is making me not think as good.

Upvotes: 0

Views: 322

Answers (1)

Animismus
Animismus

Reputation: 345

check_cols = 'Name1 Name2 Id1 Id2'.split()
val_cols = 'Attr1 Attr2'.split()

Create unique ids

background_df['full_id'] = background_df[check_cols].apply(lambda row: '_'.join([str(_).strip() for _ in row]), axis=1)
data_df['full_id'] = data_df[check_cols].apply(lambda row: '_'.join([str(_).strip() for _ in row]), axis=1)

Merge on the unique id. You can change suffixes to help you drop the extra columns later

background_df.merge(data_df, how='left', on='full_id', suffixes=('_x', '_y'))

then you would drop whatever columns you don't want OR just start the merge with fewer columns.

background_df[check_cols+['full_id',]].merge(data_df[val_cols+['full_id',]], how='left', on='full_id', suffixes=('_x', '_y'))

There is the extra thing you do not talk about. Do you have info on cols Attr1 and Attr2 on background_df that you would wish to keep instead of what might be in data_df? If so, then this would need further checking on both _x and _y attr cols.

I like to use iterrows, but you specifically said no to that.

Upvotes: 1

Related Questions