Jo T
Jo T

Reputation: 21

Merging dataframes with two key columns and repeated keys occurences (pandas)

I am trying to merge two datasets with two key columns, and repeated occurrences of the keys, however, I cannot obtain the result I want. to simplify I have here an example :

Dataframe A

   a  b    x
0  0  0  100
1  0  1  101
2  0  2  102
3  1  1  103

Dataframe B

   a  b  y
0  0  0  C
1  0  1  A
2  0  1  B
3  1  1  C
4  1  1  D

I'm trying a left join merge operation on the columns 'a' and 'b':

C=A.merge(B,how='left',on=['a','b'])

The output is the

Dataframe C

   a  b    x    y
0  0  0  100  NaN
1  0  1  101  NaN
2  0  2  102  NaN
3  1  1  103  NaN

What I really want is :

Dataframe C

   a  b    x    y
0  0  0  100  C
1  0  1  101  A
2  0  1  101  B
3  0  2  102  NaN
4  1  1  103  C
5  1  1  103  D

I tried the 4 different options of "how=" parameters, nothing gives me that result.

Notice that on Dataiku, using only a left join I obtain this result. I searched for a solution without success. I hope that you can help me to find it. Thanks in advance.

Jo

Edit1: The "outer" option :

C=pd.merge(A,B,how='outer',on=['a','b'])

C
Out[241]: 
     a    b      x    y
0  0.0  0.0  100.0  NaN
1  0.0  1.0  101.0  NaN
2  0.0  2.0  102.0  NaN
3  1.0  1.0  103.0  NaN
4  0.0  0.0    NaN    C
5  0.0  1.0    NaN    A
6  0.0  1.0    NaN    B
7  1.0  1.0    NaN    C
8  1.0  1.0    NaN    D

Edit2: Adding details about dataframes

A.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
a    4 non-null int32
b    4 non-null int32
x    4 non-null int32
dtypes: int32(3)
memory usage: 128.0 bytes

B.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
a    5 non-null object
b    5 non-null object
y    5 non-null object
dtypes: object(3)
memory usage: 200.0+ bytes

Upvotes: 2

Views: 3103

Answers (2)

DSM
DSM

Reputation: 353059

Your .info() output reveals the problem: your a and b columns aren't of the same types. In A, they're int32, but in B, they're object (almost certainly strings). For example, if they're both ints:

In [32]: A.merge(B, how='left')
Out[32]: 
   a  b    x    y
0  0  0  100    C
1  0  1  101    A
2  0  1  101    B
3  0  2  102  NaN
4  1  1  103    C
5  1  1  103    D

In [33]: A.merge(B, how='outer')
Out[33]: 
   a  b    x    y
0  0  0  100    C
1  0  1  101    A
2  0  1  101    B
3  0  2  102  NaN
4  1  1  103    C
5  1  1  103    D

but if B contains strings, we reproduce your problems:

In [35]: A.merge(B.astype(str), how='left')
Out[35]: 
   a  b    x    y
0  0  0  100  NaN
1  0  1  101  NaN
2  0  2  102  NaN
3  1  1  103  NaN

In [36]: A.merge(B.astype(str), how='outer')
Out[36]: 
     a    b      x    y
0  0.0  0.0  100.0  NaN
1  0.0  1.0  101.0  NaN
2  0.0  2.0  102.0  NaN
3  1.0  1.0  103.0  NaN
4  0.0  0.0    NaN    C
5  0.0  1.0    NaN    A
6  0.0  1.0    NaN    B
7  1.0  1.0    NaN    C
8  1.0  1.0    NaN    D

Figure out why your B contains strings, in case there's a problem upstream, and then either fix it there or coerce back to integers later.

Upvotes: 3

Zero
Zero

Reputation: 76917

Use outer join

In [1266]: df1.merge(df2, how='outer', on=['a','b'])
Out[1266]:
   a  b    x    y
0  0  0  100    C
1  0  1  101    A
2  0  1  101    B
3  0  2  102  NaN
4  1  1  103    C
5  1  1  103    D

Details

In [1267]: df1
Out[1267]:
   a  b    x
0  0  0  100
1  0  1  101
2  0  2  102
3  1  1  103

In [1268]: df2
Out[1268]:
   a  b  y
0  0  0  C
1  0  1  A
2  0  1  B
3  1  1  C
4  1  1  D

Upvotes: 2

Related Questions