Reputation: 21
I am trying to merge two datasets with two key columns, and repeated occurrences of the keys, however, I cannot obtain the result I want. to simplify I have here an example :
Dataframe A
a b x
0 0 0 100
1 0 1 101
2 0 2 102
3 1 1 103
Dataframe B
a b y
0 0 0 C
1 0 1 A
2 0 1 B
3 1 1 C
4 1 1 D
I'm trying a left join merge operation on the columns 'a' and 'b':
C=A.merge(B,how='left',on=['a','b'])
The output is the
Dataframe C
a b x y
0 0 0 100 NaN
1 0 1 101 NaN
2 0 2 102 NaN
3 1 1 103 NaN
What I really want is :
Dataframe C
a b x y
0 0 0 100 C
1 0 1 101 A
2 0 1 101 B
3 0 2 102 NaN
4 1 1 103 C
5 1 1 103 D
I tried the 4 different options of "how=" parameters, nothing gives me that result.
Notice that on Dataiku, using only a left join I obtain this result. I searched for a solution without success. I hope that you can help me to find it. Thanks in advance.
Jo
Edit1: The "outer" option :
C=pd.merge(A,B,how='outer',on=['a','b'])
C
Out[241]:
a b x y
0 0.0 0.0 100.0 NaN
1 0.0 1.0 101.0 NaN
2 0.0 2.0 102.0 NaN
3 1.0 1.0 103.0 NaN
4 0.0 0.0 NaN C
5 0.0 1.0 NaN A
6 0.0 1.0 NaN B
7 1.0 1.0 NaN C
8 1.0 1.0 NaN D
Edit2: Adding details about dataframes
A.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
a 4 non-null int32
b 4 non-null int32
x 4 non-null int32
dtypes: int32(3)
memory usage: 128.0 bytes
B.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
a 5 non-null object
b 5 non-null object
y 5 non-null object
dtypes: object(3)
memory usage: 200.0+ bytes
Upvotes: 2
Views: 3103
Reputation: 353059
Your .info()
output reveals the problem: your a and b columns aren't of the same types. In A, they're int32, but in B, they're object (almost certainly strings). For example, if they're both ints:
In [32]: A.merge(B, how='left')
Out[32]:
a b x y
0 0 0 100 C
1 0 1 101 A
2 0 1 101 B
3 0 2 102 NaN
4 1 1 103 C
5 1 1 103 D
In [33]: A.merge(B, how='outer')
Out[33]:
a b x y
0 0 0 100 C
1 0 1 101 A
2 0 1 101 B
3 0 2 102 NaN
4 1 1 103 C
5 1 1 103 D
but if B contains strings, we reproduce your problems:
In [35]: A.merge(B.astype(str), how='left')
Out[35]:
a b x y
0 0 0 100 NaN
1 0 1 101 NaN
2 0 2 102 NaN
3 1 1 103 NaN
In [36]: A.merge(B.astype(str), how='outer')
Out[36]:
a b x y
0 0.0 0.0 100.0 NaN
1 0.0 1.0 101.0 NaN
2 0.0 2.0 102.0 NaN
3 1.0 1.0 103.0 NaN
4 0.0 0.0 NaN C
5 0.0 1.0 NaN A
6 0.0 1.0 NaN B
7 1.0 1.0 NaN C
8 1.0 1.0 NaN D
Figure out why your B contains strings, in case there's a problem upstream, and then either fix it there or coerce back to integers later.
Upvotes: 3
Reputation: 76917
Use outer
join
In [1266]: df1.merge(df2, how='outer', on=['a','b'])
Out[1266]:
a b x y
0 0 0 100 C
1 0 1 101 A
2 0 1 101 B
3 0 2 102 NaN
4 1 1 103 C
5 1 1 103 D
Details
In [1267]: df1
Out[1267]:
a b x
0 0 0 100
1 0 1 101
2 0 2 102
3 1 1 103
In [1268]: df2
Out[1268]:
a b y
0 0 0 C
1 0 1 A
2 0 1 B
3 1 1 C
4 1 1 D
Upvotes: 2