Drop rows that exist in other dataframe, using 2 different columns

Question

i got the following problem: Got 2 pandas dataframes, like :

importdf:
     System   Email
 1   Basic    testimail@yahoo.com
 2   Basic    anotheremail@yahoo.com
 3   Backend  newemail@yahoo.com

userdf:
     System   Email
 1   Basic    testimail@yahoo.com
 2   Backend   anotheremail@yahoo.com
 3   Basic    newemail@yahoo.com

What i need to do, is to drop every row in importdf, that exist in userdf. Because i got some extra columns with different data in the original DataFrames, i can't just tell pandas to remove dublicated rows. At this point, im using the following code to deal with that:

importdf_system = importdf['System'].tolist()
importdf_emails = importdf['Email'].tolist()

userdf_system = userdf['System'].tolist()
userdf_emails = userdf['Email'].tolist()

importdf.reset_index(drop=True)
userdf.reset_index(drop=True)

counter = len(importdf)
for i in range(len(importdf)):
    counter = counter - 1
    print(counter)
    for j in range(len(userdf)):
        if "@" in str(importdf_emails[i]) and "@" in str(userdf_emails[j]) and str(importdf_emails[i]).lower() == str(userdf_emails[j]).lower():
            importdf = importdf.drop([i])

Sometimes, that code works well, but needs hours to run, because the dataframes are huge. Also, sometimes, i am getting errors like KeyError: '[1782] not found in axis'

I've looked to find better way, but didn't find useful sollution. Found a way to use 1 column to look for existing data, but the problem is, that i need to remove the rows, only if the system and the emails are same. for example, if i got same email but with different system on the row, it have to stay.

Umar.H · Accepted Answer

IIUC, you can do an left merge and specify values that are only in the left data frame.

new_df = (
    pd.merge(df1, df2, on=["System", "Email"], how="left", indicator=True)
    .query('_merge == "left_only"')
    .drop("_merge", 1)
)


print(new_df)

    System                   Email
1    Basic  anotheremail@yahoo.com
2  Backend      newemail@yahoo.com

Details

pd.merge(df1,df2,on=['System','Email'],how='left',indicator=True)

    System                   Email     _merge
0    Basic     testimail@yahoo.com       both # < we will drop this.
1    Basic  anotheremail@yahoo.com  left_only
2  Backend      newemail@yahoo.com  left_only

Drop rows that exist in other dataframe, using 2 different columns

Answers (2)

Details

Related Questions