S.Zhong
S.Zhong

Reputation: 101

Merging two dataframes with overlapping and non-overlapping column names

I have two dataframes of the same row length, but with a different number of columns (the first dataframe has ~57, while the second one has ~28).

The first few columns in each dataframe are identical to each other (same name/values), with there being only some overlap between the trailing "Category" columns in each dataframe (sometime same name, sometimes same value). In the "Category" columns in each dataframe, the values are either 1 or 0.

I want to merge both dataframes together on the first column (Document #); however, in the instance that a "Category" column name is the same between both dataframes, for each row's cell in the duplicate columns, I want to take the max number as the final value.

I've included a screenshot below of some simplified sample data that I'm working with (actual data has many more "Category" columns). In the data, you can see that df1 and df2 have two overlapping Category columns in red (all values are int). I want to merge both together (while keeping the non-overlapping columns), and have the values in the overlapping columns update to the max value between the two columns.

enter image description here

How would I do this? I've tried messing around with the combine_first function, but I want to be able to join in the non-overlapping columns as well. Also want to try to have this done programmatically, as the number of columns are too big to manually look through and slice, as well as the fact that the column names, position, and number dynamically change based on what data I'm feeding into it.

Upvotes: 2

Views: 1312

Answers (1)

You can merge both dataframes by "Document #", then perform the desided calculation across common categories using axis=1. Here's a scalable approach:

# Merge dataframes
df = df1.merge(df2, on='Document #')

# Get common category names across datasets
common_cat_cols = df1.columns[df1.columns.str.startswith('Category') & (df1.columns.isin(df2.columns))]
# Get common category names across datasets with suffixes for posterior removal
common_cat_cols2remove = [cat + suffix for cat in common_cat_cols for suffix in ['_x','_y']]

# Calculate max value per common category 
common_cat_max = {col: df[[f'{col}_x', f'{col}_y']].max(axis=1) for col in common_cat_cols}

# Insert the target-columns in the merged dataframe
df = df.assign(**common_cat_max).drop(columns=common_cat_cols2remove)

# Now, we just re-order the columns to get the expected output
new_col_order = sum([['Document #'], ['Document Text_' + s for s in ['x','y']] + sorted(df.columns[df.columns.str.startswith('Category')])], [])
df = df[new_col_order]
print(df)

Output:

   Document # Document Text_x Document Text_y  Category A  Category B   Category C  Category D  Category E  Category F  Category G  
0           1               a               a           1           0            1           0           1           0           1  
1           2              as              as           1           0            1           0           1           0           1  
2           3             asd             asd           1           1            1           0           1           0           1  
3           4            asdf            asdf           1           1            1           0           0           1           1  
4           5           asdfa           asdfa           1           0            1           1           1           1           1  
5           6          asdfas          asdfas           1           0            1           1           1           0           0  
6           7         asdfasd         asdfasd           0           1            1           1           1           0           0  
7           8        asdfasdf        asdfasdf           1           1            1           1           1           0           0 

As you can see, I kept "Document Text" untouched since I don't know what do you wanna do with them. Best!


Data:

df1 = pd.DataFrame({
    'Document #': range(1,9),
    'Document Text': ['a','as','asd','asdf','asdfa','asdfas','asdfasd','asdfasdf'],
    'Category A': [1,0,0,0,0,0,0,1],
    'Category B': [0,0,1,1,0,0,1,1],
    'Category C': [1,0,0,0,0,0,0,0],
    'Category D': [0,0,0,0,1,1,1,1],
    'Category E': [1,1,1,0,1,1,1,1]
})

df2 = pd.DataFrame({
    'Document #': range(1,9),
    'Document Text': ['a','as','asd','asdf','asdfa','asdfas','asdfasd','asdfasdf'],
    'Category A': [1,1,1,1,1,1,0,0],
    'Category C': [0,1,1,1,1,1,1,1],
    'Category F': [0,0,0,1,1,0,0,0],
    'Category G': [1,1,1,1,1,0,0,0]
})

Upvotes: 1

Related Questions