astrochun
astrochun

Reputation: 1796

Merging two pandas data frames with duplicate ON entries

I'm working with two pandas DataFrame (two separate years), that contains information about individuals. I'm using pd.merge with inner join. I can set the 'on' based on the individual names. However, I noticed that there are duplicate entries with the same name (essentially two people with the same name). To separate these records I use other fields, such as their locations. My code looks like:

df_combine = pd.merge(df_current, df_old, on=['Name', 'Department', 'College Name'])

While that works to fix this specific case, there are cases where someone with a unique name moves department. As such, this is not caught.

Is there way to first merge by name for unique cases and then for duplicate records by name to consider other fields?

For clarity, consider the records having two "Doe,John" in two different departments and one "Doe,Jane" in one department for one year and another for the second year. I would like the code to recognize the two "Doe,John" and one "Doe,Jane"

Edit, example dataframes:

Latest dataframe:

Name, Department, Field1
"Doe,John","ABC",555
"Doe,John","DEF",145
"Doe,Jane","FGH",205

Previous dataframe:

Name, Department, Field1
"Doe,John","ABC",345
"Doe,John","DEF",140
"Doe,Jane","ABC",200

Note the change in Jane Doe's department.

Expected result:

Name, Department, Field1_A, Field1_B
"Doe,John","ABC",555,345
"Doe,John","DEF",145,140
"Doe,Jane","FGH",205,200

Upvotes: 0

Views: 329

Answers (1)

Carmoreno
Carmoreno

Reputation: 1319

You can use pd.merge with how param seted as 'left', it's important the first dataframe is the latest in order to conserve the data newest. Note also, I haven't used Field1 as key in on param because I want to conserve both columns in both dataframes.

#Sample data
df_lastest = pd.DataFrame(columns=['Name', 'Department', 'Field1'], data=[["Doe,John","ABC",555], ["Doe,John","DEF",145], ["Doe,Jane","FGH",205]])  
df_previous = pd.DataFrame(columns=['Name', 'Department', 'Field1'], data=[["Doe,John","ABC",345], ["Doe,John","DEF",140], ["Doe,Jane","ABC",200]])

df_merge = pd.merge(df_lastest, df_previous, how='left', on=['Name', 'Department'], suffixes=('_A', '_B'))
df_merge['Field1_B'].fillna(df_previous['Field1'], inplace=True)

print(df_merge)

Output:

     Name   Department  Field1_A    Field1_B
0   Doe,John    ABC     555           345
1   Doe,John    DEF     145           140
2   Doe,Jane    FGH     205           200

Screenshot using Google Colab

Upvotes: 1

Related Questions