Reputation: 23
i got the following problem: Got 2 pandas dataframes, like :
importdf:
System Email
1 Basic [email protected]
2 Basic [email protected]
3 Backend [email protected]
userdf:
System Email
1 Basic [email protected]
2 Backend [email protected]
3 Basic [email protected]
What i need to do, is to drop every row in importdf, that exist in userdf. Because i got some extra columns with different data in the original DataFrames, i can't just tell pandas to remove dublicated rows. At this point, im using the following code to deal with that:
importdf_system = importdf['System'].tolist()
importdf_emails = importdf['Email'].tolist()
userdf_system = userdf['System'].tolist()
userdf_emails = userdf['Email'].tolist()
importdf.reset_index(drop=True)
userdf.reset_index(drop=True)
counter = len(importdf)
for i in range(len(importdf)):
counter = counter - 1
print(counter)
for j in range(len(userdf)):
if "@" in str(importdf_emails[i]) and "@" in str(userdf_emails[j]) and str(importdf_emails[i]).lower() == str(userdf_emails[j]).lower():
importdf = importdf.drop([i])
Sometimes, that code works well, but needs hours to run, because the dataframes are huge. Also, sometimes, i am getting errors like KeyError: '[1782] not found in axis'
I've looked to find better way, but didn't find useful sollution. Found a way to use 1 column to look for existing data, but the problem is, that i need to remove the rows, only if the system and the emails are same. for example, if i got same email but with different system on the row, it have to stay.
Upvotes: 2
Views: 69
Reputation: 26676
To keep with new row
#Check column which is a concat of System and Email
userdf['check']=userdf.System.str.cat(userdf.Email)
importdf['check']=importdf.System.str.cat(importdf.Email)
#Use np.where to check and attribu
te
res=userdf.assign(filter=np.where([x not in y for x, y in zip(userdf.check,importdf.check)],'drop','keep')).drop(columns=['check'])
print(res)
System Email filter
1 Basic [email protected] keep
2 Backend [email protected] drop
3 Basic [email protected] drop
Upvotes: 0
Reputation: 23099
IIUC, you can do an left merge and specify values that are only in the left data frame.
new_df = (
pd.merge(df1, df2, on=["System", "Email"], how="left", indicator=True)
.query('_merge == "left_only"')
.drop("_merge", 1)
)
print(new_df)
System Email
1 Basic [email protected]
2 Backend [email protected]
pd.merge(df1,df2,on=['System','Email'],how='left',indicator=True)
System Email _merge
0 Basic [email protected] both # < we will drop this.
1 Basic [email protected] left_only
2 Backend [email protected] left_only
Upvotes: 2