Reputation: 69
I want to create a new dataframe by merging two seperate dataframes. The data frames share a common key and some common columns. This common columns also contain some but not all of the same values. I would like to remove the duplicate values and keep both values are that different in each cell. My data looks something like this:
Left:
key1 valueZ valueX valueY
A bob 1 4
B jes 8 5
C joe 3 6
Right:
key1 valueZ valueX valueY valueK
A sam 7 4 hill town
B beth 8 11 market
C joe 9 12 mall
The expected output would be:
key1 valueZ valueX valueY valueK
A bob/sam 1/7 4 hill town
B jes/beth 8 5/11 market
C joe 3/9 6/12 mall
Upvotes: 1
Views: 4321
Reputation: 1203
You will need to do this in a few steps.
Here is my setup for reference:
import pandas as pd
# define Data Frames
left = pd.DataFrame({
'key1': ['A', 'B', 'C'],
'valueZ': ['bob', 'jes', 'joe'],
'valueX': [1, 8, 3],
'valueY': [4, 5, 6]
})
right = pd.DataFrame({
'key1': ['A', 'B', 'C'],
'valueZ': ['sam', 'beth', 'joe'],
'valueX': [7, 8, 9],
'valueY': [4, 11, 12],
'valueK': ['hill town', 'market', 'mall']
})
Now I have two DataFrame
objects. They are left
and right
and match your example.
In order to combine how you want, I will need to know which columns are in common between the two Data Frames, as well as the final list of columns. I also define the key column here for ease of configuration. You can do that like this:
# determine important columns
keyCol = 'key1'
commonCols = list(set(left.columns & right.columns))
finalCols = list(set(left.columns | right.columns))
print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))
Which gives:
Common = ['valueZ', 'valueX', 'valueY', 'key1'], Final = ['valueZ', 'key1', 'valueK', 'valueX', 'valueY']
Next, you will join the two Data Frames as normal, but give the columns in both Data Frames a suffix (documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
# join dataframes with suffixes
mergeDf = left.merge(right, how='left', on=keyCol, suffixes=('_left', '_right'))
Finally, you will combine the common columns using whatever logic you desire. Once they are combined, you can remove the suffixed columns from your Data Frame. Example below. You can do this in more efficient ways, but I wanted to break it down for clarity.
# combine the common columns
for col in commonCols:
if col != keyCol:
for i, row in mergeDf.iterrows():
leftVal = str(row[col + '_left'])
rightVal = str(row[col + '_right'])
print(leftVal + ',' + rightVal)
if leftVal == rightVal:
mergeDf.loc[i, col] = leftVal
else:
mergeDf.loc[i, col] = leftVal + '/' + rightVal
# only use the finalCols
mergeDf = mergeDf[finalCols]
This gives:
valueZ key1 valueK valueX valueY
0 bob/sam A hill town 1/7 4
1 jes/beth B market 8 5/11
2 joe C mall 3/9 6/12
Upvotes: 1