Merging two dataframes without creating suffix

Question

I want to create a new dataframe by merging two seperate dataframes. The data frames share a common key and some common columns. This common columns also contain some but not all of the same values. I would like to remove the duplicate values and keep both values are that different in each cell. My data looks something like this:

Left:

key1  valueZ  valueX  valueY
 A    bob     1       4
 B    jes     8       5
 C    joe     3       6

Right:

key1  valueZ  valueX  valueY  valueK
 A    sam     7       4       hill town
 B    beth    8       11      market
 C    joe     9       12      mall

The expected output would be:

key1    valueZ     valueX     valueY   valueK 
A       bob/sam      1/7       4       hill town  
B       jes/beth     8         5/11    market       
C       joe          3/9       6/12    mall

Nathan Clement · Accepted Answer

You will need to do this in a few steps.

Here is my setup for reference:

import pandas as pd
# define Data Frames
left = pd.DataFrame({
    'key1': ['A', 'B', 'C'],
    'valueZ': ['bob', 'jes', 'joe'],
    'valueX': [1, 8, 3],
    'valueY': [4, 5, 6]
})
right = pd.DataFrame({
    'key1': ['A', 'B', 'C'],
    'valueZ': ['sam', 'beth', 'joe'],
    'valueX': [7, 8, 9],
    'valueY': [4, 11, 12],
    'valueK': ['hill town', 'market', 'mall']
})

Now I have two DataFrame objects. They are left and right and match your example.

In order to combine how you want, I will need to know which columns are in common between the two Data Frames, as well as the final list of columns. I also define the key column here for ease of configuration. You can do that like this:

# determine important columns
keyCol = 'key1'
commonCols = list(set(left.columns & right.columns))
finalCols = list(set(left.columns | right.columns))
print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))

Which gives:

Common = ['valueZ', 'valueX', 'valueY', 'key1'], Final = ['valueZ', 'key1', 'valueK', 'valueX', 'valueY']

Next, you will join the two Data Frames as normal, but give the columns in both Data Frames a suffix (documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

# join dataframes with suffixes
mergeDf = left.merge(right, how='left', on=keyCol, suffixes=('_left', '_right'))

Finally, you will combine the common columns using whatever logic you desire. Once they are combined, you can remove the suffixed columns from your Data Frame. Example below. You can do this in more efficient ways, but I wanted to break it down for clarity.

# combine the common columns
for col in commonCols:
    if col != keyCol:
        for i, row in mergeDf.iterrows():
            leftVal = str(row[col + '_left'])
            rightVal = str(row[col + '_right'])
            print(leftVal + ',' + rightVal)
            if leftVal == rightVal:
                mergeDf.loc[i, col] = leftVal
            else:
                mergeDf.loc[i, col] = leftVal + '/' + rightVal

# only use the finalCols
mergeDf = mergeDf[finalCols]

This gives:

     valueZ key1     valueK valueX valueY
0   bob/sam    A  hill town    1/7      4
1  jes/beth    B     market      8   5/11
2       joe    C       mall    3/9   6/12

Merging two dataframes without creating suffix

Answers (1)

Related Questions