Reputation: 3477
I have two data frames that I obtained from a couple of csv files. For example:
df1
data1 data2 data3
0 cow cat 53
1 girl boy 12
2 monkey island 30
3 lucas arts 14
df2
data1 data2 data3
0 girl boy 50
1 cover disc 45
2 girl boy 47
3 pen pencil 15
4 book note 30
5 lucas arts 15
df2
data1 data2 data3
0 cover disc 45
1 pen pencil 15
2 book note 30
I would like to delete from df2 those rows that have the same values of data1 and data2 from df1, so that at the end I will end up with an updated df2. So far, I have done the following:
file1="fileA.csv"
file2="fileB.csv"
df1=pd.read_csv(file1)
df2=pd.read_csv(file2)
cond = df1[['data1','data2']].isin(df2[['data1','data2']])
df2.drop(df1[cond].index,inplace=True)
df2.to_csv("fileBUpdtated.csv",index=False)
But I am not getting the results that I need, what am I missing?
Upvotes: 3
Views: 736
Reputation: 323276
Let us try
out = df2[df2[['data1', 'data2']].apply(tuple,1).isin(df1[['data1', 'data2']].apply(tuple,1))]
Upvotes: 0
Reputation: 8508
The best option I found to do this is the following:
print (df2[~df2.isin(df1[['data1','data2']])])
Here, you are searching the values in each row of df2 (superset dataframe) against the values in df1 (2 specific columns in subset dataframe). If they are not found, then you want to print those records from df2.
This one does not do a merge. Instead, it is using boolean and index.
The output of this will be:
data1 data2 data3
1 cover disc 45
3 pen pencil 15
4 book note 30
Alternate, you can do the below:
You can find the common items in both, then exclude them and print the values.
If you want to remove them from df2, then reassign it back to df2.
common = df2.merge(df1,on=['data1','data2'])
df2 = df2[(~df2.data1.isin(common.data1))&(~df2.data2.isin(common.data2))]
print (df2)
The output of this will be:
data1 data2 data3
1 cover disc 45
3 pen pencil 15
4 book note 30
Upvotes: 1
Reputation: 5036
You can append df1
to df2
, then drop_duplicates
and keep the rows that are left and came from df2
df2.append(df1, ignore_index=True).drop_duplicates(['data1','data2'], keep=False).loc[:df2.index.max()]
Out:
data1 data2 data3
1 cover disc 45
3 pen pencil 15
4 book note 30
Upvotes: 2
Reputation: 30920
You need DataFrame.merge
cols = ['data1', 'data2']
df2 = df2.merge(df1[cols], on=cols, how='left', indicator=True)\
.loc[lambda x: x._merge.ne('both')].drop('_merge', axis=1)
print(df2)
data1 data2 data3
1 cover disc 45
3 pen pencil 15
4 book note 30
Upvotes: 6