Reputation: 1796
I'm working with two pandas
DataFrame (two separate years), that contains information about individuals. I'm using pd.merge
with inner join. I can set the 'on' based on the individual names. However, I noticed that there are duplicate entries with the same name (essentially two people with the same name). To separate these records I use other fields, such as their locations. My code looks like:
df_combine = pd.merge(df_current, df_old, on=['Name', 'Department', 'College Name'])
While that works to fix this specific case, there are cases where someone with a unique name moves department. As such, this is not caught.
Is there way to first merge by name for unique cases and then for duplicate records by name to consider other fields?
For clarity, consider the records having two "Doe,John" in two different departments and one "Doe,Jane" in one department for one year and another for the second year. I would like the code to recognize the two "Doe,John" and one "Doe,Jane"
Edit, example dataframes:
Latest dataframe:
Name, Department, Field1
"Doe,John","ABC",555
"Doe,John","DEF",145
"Doe,Jane","FGH",205
Previous dataframe:
Name, Department, Field1
"Doe,John","ABC",345
"Doe,John","DEF",140
"Doe,Jane","ABC",200
Note the change in Jane Doe's department.
Expected result:
Name, Department, Field1_A, Field1_B
"Doe,John","ABC",555,345
"Doe,John","DEF",145,140
"Doe,Jane","FGH",205,200
Upvotes: 0
Views: 329
Reputation: 1319
You can use pd.merge
with how
param seted as 'left'
, it's important the first dataframe is the latest in order to conserve the data newest. Note also, I haven't used Field1
as key in on
param because I want to conserve both columns in both dataframes.
#Sample data
df_lastest = pd.DataFrame(columns=['Name', 'Department', 'Field1'], data=[["Doe,John","ABC",555], ["Doe,John","DEF",145], ["Doe,Jane","FGH",205]])
df_previous = pd.DataFrame(columns=['Name', 'Department', 'Field1'], data=[["Doe,John","ABC",345], ["Doe,John","DEF",140], ["Doe,Jane","ABC",200]])
df_merge = pd.merge(df_lastest, df_previous, how='left', on=['Name', 'Department'], suffixes=('_A', '_B'))
df_merge['Field1_B'].fillna(df_previous['Field1'], inplace=True)
print(df_merge)
Output:
Name Department Field1_A Field1_B
0 Doe,John ABC 555 345
1 Doe,John DEF 145 140
2 Doe,Jane FGH 205 200
Upvotes: 1