Reputation: 695
So I may simply not be very informed on what a left join is, because I am getting tripped up... Here is my definition of a left join:
Includes matched records from the left and right tables and unmatched records from the LEFT table in the output table.
Here is my example:
In[87]: df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
In[88]: df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})
In[89]: pd.merge(df1, df2, on='key', how='left')
Out[86]:
data1 key data2
0 0 b 1
1 0 b 3
2 1 b 1
3 1 b 3
4 2 a 0
5 2 a 2
6 3 c NaN
7 4 a 0
8 4 a 2
9 5 b 1
10 5 b 3
HOWEVER!!! I expect to get this:
data1 key data2
0 0 b 1
1 1 b 1
2 2 a 0
3 3 c NaN
4 4 a 0
5 5 b 1
My general thinking comes from Transactional data (such as accounting documents where I may be merging header and item detail, or merging lookup data).
What am I missing in my thinking or code to make this work?
PS - This comes from working through Wes McKinney's Python for Data Analysis book (page 179) - where he mentions the following:
Many-to-many merges have well-defined though not necessarily intuitive behavior. Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in the left DataFrame and 2 in the right one, there are 6 'b' rows in the result.
I guess I am missing the point here?
Upvotes: 3
Views: 3483
Reputation: 21574
A way to get your expected output is to groupby by data1
and get the first value of each group:
g = df.groupby('data1').first().reset_index()
That returns:
data1 key data2
0 0 b 1
1 1 b 1
2 2 a 0
3 3 c NaN
4 4 a 0
5 5 b 1
Hope that helps.
Upvotes: 2