Merge DataFrames on two columns

Question

This is a follow-up from this question

I have two pandas DataFrames, as follows:

print( a )

    foo   bar   let letval
9  foo1  bar1  let1      a
8  foo2  bar2  let1      b
7  foo3  bar3  let1      c
6  foo1  bar1  let2      z
5  foo2  bar2  let2      y
4  foo3  bar3  let2      x

print( b )

    foo   bar   num  numval
0  foo1  bar1  num1       1
1  foo2  bar2  num1       2
2  foo3  bar3  num1       3
3  foo1  bar1  num2       4
4  foo2  bar2  num2       5
5  foo3  bar3  num2       6

I want to merge the two of them on the columns [ 'foo', 'bar' ].

If I simply do c = pd.merge( a, b, on=['foo', 'bar'] ), I get:

prnint( c )

     foo   bar   let letval   num  numval
0   foo1  bar1  let1      a  num1       1
1   foo1  bar1  let1      a  num2       4
2   foo1  bar1  let2      z  num1       1
3   foo1  bar1  let2      z  num2       4
4   foo2  bar2  let1      b  num1       2
5   foo2  bar2  let1      b  num2       5
6   foo2  bar2  let2      y  num1       2
7   foo2  bar2  let2      y  num2       5
8   foo3  bar3  let1      c  num1       3
9   foo3  bar3  let1      c  num2       6
10  foo3  bar3  let2      x  num1       3
11  foo3  bar3  let2      x  num2       6

I would like:

print( c )

    foo   bar   let letval   num   numval
0  foo1  bar1  let1      a   num1       1
1  foo2  bar2  let1      b   num1       2
2  foo3  bar3  let1      c   num1       3
3  foo1  bar1  let2      z   num2       4
4  foo2  bar2  let2      y   num2       5
5  foo3  bar3  let2      x   num2       6

The closest I've got is:

c = pd.merge( a, b, left_index=['foo', 'bar'], right_index=['foo', 'bar'] )

What am I missing?

And why do I get c.shape = (12,6) in the first example?

Edit

Thanks to @piRSquared's answer I realized that the underlying problem is that there is not a single combination of columns to do that. Thus the merge problem, as posed before cannot be univocally solved. That said, the question is converted into a simpler one:

How to make a univocal relationship between the tables?

I solved that with a dictionary that maps the desired outputs that need to be aligned:

map_ab = { 'num1':'let1', 'num2':'let2' }
b['let'] = b.apply( lambda x: map_ab[x['num']], axis=1 )
c = pd.merge( a, b, on=['foo', 'bar', 'let'] )
print( c )

piRSquared · Accepted Answer

The reason you are getting that is because the columns you are merging on do not constitute unique combinations. For example, The first (index 0) row of a has foo1 and bar1, but so does the fourth row (index 3). Ok, that's fine, but b has the same issue. So, when you match up b's foo1 & bar1 for row indexed with 0 it matches twice. Same is true when you match foo1 & bar1 in row indexed with 3, it matches twice. So you end up with four matches for those 2 rows.

So you get

a row 0 matches with b row 0
a row 0 matches with b row 3
a row 3 matches with b row 0
a row 3 matches with b row 3

And THEN, your example does this 2 more times. 3 * 4 == 12

The only way to do this and be unambiguous is to decide on a rule on which match to take if there are more than one matches. I decided to groupby one of your other columns then take the first one. It still doesn't match your expected output but I'm proposing that you gave a bad example.

pd.merge( a, b, on=['foo', 'bar']).groupby(['foo', 'bar', 'let'], as_index=False).first()

Merge DataFrames on two columns

Answers (2)

Related Questions