Dinosaurius
Dinosaurius

Reputation: 8628

Left merging does not work

When I merge two simple dataframes, then everything works fine. But when I apply the same code to my real dataframes, then the merging does not work correctly:

I want to merge df1 and df2 on column A using left joining.

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A4','A5'],
                    'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
                    'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})

result = pd.merge(df1, df2[["A","C"]], how='left', on='A')

In this case the result is correct (the number of rows in result is the same as df1).

However when I run the same code on my real data, the number of rows in result is much larger than df1 and is more similar to df2.

result = pd.merge(df1, df2[["ID","EVENT"]], how='left', on='ID')

The field ID is of type String (astype(str)).

What might be the reason on this? I cannot post here the real dataset, but maybe some indications still might be done based on my explanation. Thanks.

UDPATE:

I checked the dataframe result and I can see many duplicated rows having the same ID. Why?

Upvotes: 2

Views: 136

Answers (1)

IanS
IanS

Reputation: 16251

See this slightly modified example (I modified the last two values in column A in df2):

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A0','A0'],
                    'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
                    'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})

result = pd.merge(df1, df2[["A","C"]], how='left', on='A')

Output:

    A   B   C
0  A0  B0  C0
1  A0  B0  C4
2  A0  B0  C5
3  A1  B1  C1
4  A2  B2  C2
5  A3  B3  C3

There is one A0 row for each A0 in df2. This is also what is happening with your data.

Upvotes: 1

Related Questions