Reputation: 3497
This is a follow-up from this question
I have two pandas DataFrames, as follows:
print( a )
foo bar let letval
9 foo1 bar1 let1 a
8 foo2 bar2 let1 b
7 foo3 bar3 let1 c
6 foo1 bar1 let2 z
5 foo2 bar2 let2 y
4 foo3 bar3 let2 x
print( b )
foo bar num numval
0 foo1 bar1 num1 1
1 foo2 bar2 num1 2
2 foo3 bar3 num1 3
3 foo1 bar1 num2 4
4 foo2 bar2 num2 5
5 foo3 bar3 num2 6
I want to merge
the two of them on the columns [ 'foo', 'bar' ]
.
If I simply do c = pd.merge( a, b, on=['foo', 'bar'] )
, I get:
prnint( c )
foo bar let letval num numval
0 foo1 bar1 let1 a num1 1
1 foo1 bar1 let1 a num2 4
2 foo1 bar1 let2 z num1 1
3 foo1 bar1 let2 z num2 4
4 foo2 bar2 let1 b num1 2
5 foo2 bar2 let1 b num2 5
6 foo2 bar2 let2 y num1 2
7 foo2 bar2 let2 y num2 5
8 foo3 bar3 let1 c num1 3
9 foo3 bar3 let1 c num2 6
10 foo3 bar3 let2 x num1 3
11 foo3 bar3 let2 x num2 6
I would like:
print( c )
foo bar let letval num numval
0 foo1 bar1 let1 a num1 1
1 foo2 bar2 let1 b num1 2
2 foo3 bar3 let1 c num1 3
3 foo1 bar1 let2 z num2 4
4 foo2 bar2 let2 y num2 5
5 foo3 bar3 let2 x num2 6
The closest I've got is:
c = pd.merge( a, b, left_index=['foo', 'bar'], right_index=['foo', 'bar'] )
What am I missing?
And why do I get c.shape = (12,6)
in the first example?
Edit
Thanks to @piRSquared's answer I realized that the underlying problem is that there is not a single combination of columns to do that. Thus the merge problem, as posed before cannot be univocally solved. That said, the question is converted into a simpler one:
How to make a univocal relationship between the tables?
I solved that with a dictionary that maps the desired outputs that need to be aligned:
map_ab = { 'num1':'let1', 'num2':'let2' }
b['let'] = b.apply( lambda x: map_ab[x['num']], axis=1 )
c = pd.merge( a, b, on=['foo', 'bar', 'let'] )
print( c )
Upvotes: 3
Views: 275
Reputation: 294506
The reason you are getting that is because the columns you are merging on do not constitute unique combinations. For example, The first (index 0) row of a
has foo1
and bar1
, but so does the fourth row (index 3). Ok, that's fine, but b
has the same issue. So, when you match up b
's foo1
& bar1
for row indexed with 0
it matches twice. Same is true when you match foo1
& bar1
in row indexed with 3
, it matches twice. So you end up with four matches for those 2 rows.
So you get
a
row 0 matches with b
row 0a
row 0 matches with b
row 3a
row 3 matches with b
row 0a
row 3 matches with b
row 3And THEN, your example does this 2 more times. 3 * 4 == 12
The only way to do this and be unambiguous is to decide on a rule on which match to take if there are more than one matches. I decided to groupby one of your other columns then take the first one. It still doesn't match your expected output but I'm proposing that you gave a bad example.
pd.merge( a, b, on=['foo', 'bar']).groupby(['foo', 'bar', 'let'], as_index=False).first()
Upvotes: 2
Reputation: 7848
you can use combine_first:
In[21]:a.combine_first(b)
Out[21]:
bar foo let letval num numval
0 bar1 foo1 let1 a num1 1
1 bar2 foo2 let1 b num1 2
2 bar3 foo3 let1 c num1 3
3 bar1 foo1 let2 z num2 4
4 bar2 foo2 let2 y num2 5
5 bar3 foo3 let2 x num2 6
In the first example you are doing inner join
which returns all rows if bar
& foo
are equal in a,b
.
Upvotes: 1