Reputation: 39
I am attempting to compare two dataframes by their respective UniqueID column. The code for the following dataframes can be seen below.
# Define first dataframe
list1 = {'UniqueID': [13579, 24680, 54678, 1169780, 1195847, 23572],
'Name': ['Joe', 'Pete', 'Jessica', 'Jackson', 'Griffin', 'Katie'],
'Level': ['Beginner', 'Beginner', 'Intermediate', 'Advanced', 'Intermediate', 'Advanced']}
df1 = pd.DataFrame(list1, columns=['UniqueID','Name','Level'])
# Define second dataframe
list2 = {'UniqueID': (88922,13579, 24680, 54678, 1169780, 1195847, 23572, 54895, 478952, 45921),
'Name': ('Zain','Joe', 'Pete', 'Jessica','Griffin','Jackson','Katie', 'Gaby', 'Haley', 'Caden'),
'Level': ('Beginner', 'Intermediate', 'Intermediate', 'Advanced', 'Intermediate','Advanced','Advanced',
'Beginner', 'Intermediate', 'Novice')}
df2 = pd.DataFrame(list2, columns=['UniqueID','Name','Level'])
It can be seen above that the dataframes have a differing length in regards to their index. This is what leads to my next problem. My process to find the duplicates goes as follows.
# Define new column which displays Match iff the UniqueID of the first dataframe is equal to that of the second
df1['UniqueMatch'] = np.where(df1.UniqueID == df2.UniqueID, 'Match','Ignore') #Create
# Simplify the list to only display rows that are duplicates
df_match = df1[df1['UniqueMatch'] =='Match']
I run into an error whenever I try to find where the dataframes UniqueID's are equal to each other. The error i receive is 'ValueError: Can only compare identically-labeled Series objects'. Which, from my understanding, means that the process I am using can only be achieve if the indices of the two dataframes are equal to each other. I figure their has to be a way around this, if not then how could you compare dataframes of different sizes.
Upvotes: 2
Views: 2147
Reputation: 120391
Update according your comment:
After I find the duplicated, I would then like to iterate through each cell of level, and update df1 from the updated level listed in df2. For example, Joe goes from beginner to intermediate from df1 to df2. I would like to auto update those instances.
Concatenate your 2 dataframes and keep the last values (df2) from duplicates:
df3 = pd.concat([df1, df2], ignore_index=True) \
.drop_duplicates(['UniqueID', 'Name'], keep='last')
>>> df3
UniqueID Name Level
3 1169780 Jackson Advanced
4 1195847 Griffin Intermediate
6 88922 Zain Beginner
7 13579 Joe Intermediate # Joe is now Intermediate
8 24680 Pete Intermediate # Pete is now Intermediate
9 54678 Jessica Advanced # Jessica is now Advanced
10 1169780 Griffin Intermediate
11 1195847 Jackson Advanced
12 23572 Katie Advanced
13 54895 Gaby Beginner
14 478952 Haley Intermediate
15 45921 Caden Novice
Old answer
Find duplicates with merge
and query
:
dup = pd.merge(df1, df2, on='UniqueID') \
.query("(Name_x == Name_y) & (Level_x == Level_y)")
>>> dup
UniqueID Name_x Level_x Name_y Level_y
5 23572 Katie Advanced Katie Advanced
Upvotes: 1
Reputation: 2132
df1 (5 rows) & df2 (9 rows) are different - that the reason:
df1.equals(df2)
-> false
For printing duplicates use that piece of code:
df_merge = pd.merge(df1, df2, on=['UniqueID','Name','Level'], how='inner')
You can define in on - the proper data, which You want duplicates.
eg. ['Name] & ['Level'] (if U want duplicates only in Name - that in on should be only that column name.
returns more duplicates:
Upvotes: 1