Isaaclele
Isaaclele

Reputation: 39

How to compare two dataframes with different indices, and print out the duplicate rows?

I am attempting to compare two dataframes by their respective UniqueID column. The code for the following dataframes can be seen below.

# Define first dataframe
list1 = {'UniqueID': [13579, 24680, 54678, 1169780, 1195847, 23572],
        'Name': ['Joe', 'Pete', 'Jessica', 'Jackson', 'Griffin', 'Katie'],
        'Level': ['Beginner', 'Beginner', 'Intermediate', 'Advanced', 'Intermediate', 'Advanced']}
df1 = pd.DataFrame(list1, columns=['UniqueID','Name','Level'])

# Define second dataframe
list2 = {'UniqueID': (88922,13579, 24680, 54678, 1169780, 1195847, 23572, 54895, 478952, 45921),
        'Name': ('Zain','Joe', 'Pete', 'Jessica','Griffin','Jackson','Katie', 'Gaby', 'Haley', 'Caden'),
        'Level': ('Beginner', 'Intermediate', 'Intermediate', 'Advanced', 'Intermediate','Advanced','Advanced',
                  'Beginner', 'Intermediate', 'Novice')}
df2 = pd.DataFrame(list2, columns=['UniqueID','Name','Level'])

It can be seen above that the dataframes have a differing length in regards to their index. This is what leads to my next problem. My process to find the duplicates goes as follows.

# Define new column which displays Match iff the UniqueID of the first dataframe is equal to that of the second
df1['UniqueMatch'] = np.where(df1.UniqueID == df2.UniqueID, 'Match','Ignore') #Create

# Simplify the list to only display rows that are duplicates
df_match = df1[df1['UniqueMatch'] =='Match']

I run into an error whenever I try to find where the dataframes UniqueID's are equal to each other. The error i receive is 'ValueError: Can only compare identically-labeled Series objects'. Which, from my understanding, means that the process I am using can only be achieve if the indices of the two dataframes are equal to each other. I figure their has to be a way around this, if not then how could you compare dataframes of different sizes.

Upvotes: 2

Views: 2147

Answers (2)

Corralien
Corralien

Reputation: 120391

Update according your comment:

After I find the duplicated, I would then like to iterate through each cell of level, and update df1 from the updated level listed in df2. For example, Joe goes from beginner to intermediate from df1 to df2. I would like to auto update those instances.

Concatenate your 2 dataframes and keep the last values (df2) from duplicates:

df3 = pd.concat([df1, df2], ignore_index=True) \
        .drop_duplicates(['UniqueID', 'Name'], keep='last')
>>> df3
    UniqueID     Name         Level
3    1169780  Jackson      Advanced
4    1195847  Griffin  Intermediate
6      88922     Zain      Beginner
7      13579      Joe  Intermediate  # Joe is now Intermediate
8      24680     Pete  Intermediate  # Pete is now Intermediate
9      54678  Jessica      Advanced  # Jessica is now Advanced
10   1169780  Griffin  Intermediate
11   1195847  Jackson      Advanced
12     23572    Katie      Advanced
13     54895     Gaby      Beginner
14    478952    Haley  Intermediate
15     45921    Caden        Novice

Old answer

Find duplicates with merge and query:

dup = pd.merge(df1, df2, on='UniqueID') \
        .query("(Name_x == Name_y) & (Level_x == Level_y)")
>>> dup
   UniqueID Name_x   Level_x Name_y   Level_y
5     23572  Katie  Advanced  Katie  Advanced

Upvotes: 1

Piotr Żak
Piotr Żak

Reputation: 2132

df1 (5 rows) & df2 (9 rows) are different - that the reason:

df1.equals(df2) -> false


enter image description here


For printing duplicates use that piece of code:

df_merge = pd.merge(df1, df2, on=['UniqueID','Name','Level'], how='inner')

You can define in on - the proper data, which You want duplicates.

eg. ['Name] & ['Level'] (if U want duplicates only in Name - that in on should be only that column name.

returns more duplicates:

enter image description here

Upvotes: 1

Related Questions