Reputation: 8628
When I merge two simple dataframes, then everything works fine. But when I apply the same code to my real dataframes, then the merging does not work correctly:
I want to merge df1
and df2
on column A
using left joining.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A4','A5'],
'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})
result = pd.merge(df1, df2[["A","C"]], how='left', on='A')
In this case the result is correct (the number of rows in result
is the same as df1
).
However when I run the same code on my real data, the number of rows in result
is much larger than df1
and is more similar to df2
.
result = pd.merge(df1, df2[["ID","EVENT"]], how='left', on='ID')
The field ID
is of type String (astype(str)
).
What might be the reason on this? I cannot post here the real dataset, but maybe some indications still might be done based on my explanation. Thanks.
UDPATE:
I checked the dataframe result
and I can see many duplicated rows having the same ID
. Why?
Upvotes: 2
Views: 136
Reputation: 16251
See this slightly modified example (I modified the last two values in column A
in df2
):
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A0','A0'],
'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})
result = pd.merge(df1, df2[["A","C"]], how='left', on='A')
Output:
A B C
0 A0 B0 C0
1 A0 B0 C4
2 A0 B0 C5
3 A1 B1 C1
4 A2 B2 C2
5 A3 B3 C3
There is one A0
row for each A0
in df2
. This is also what is happening with your data.
Upvotes: 1