Reputation: 579
I have two dataframes: 1 major df and 1 with rows that I want to delete in the major one (dfmatch). The major df has more columns than the dfmatch.
I only want to delete the rows in major df if column1, column2 AND column3 equals with the value in the corresponinding columns of dfmatch.
Column extra1 and extra2 should be available in dfnew as well.
My current script only shows the column headers instead of the remaining rows:
file = 'testdf.csv'
colnames=['column1', 'column2', 'column3', 'extra1', 'extra2']
df = pd.read_csv(file, names=colnames, header=None)
file = 'testdfmatch.csv'
colnames=['column1', 'column2', 'column3']
dfmatch = pd.read_csv(file, names=colnames, header=None)
dfnew = pd.concat([dfmatch,df,df], sort=False).drop_duplicates(['column1', 'column2', 'column3'], keep=False)
Upvotes: 1
Views: 508
Reputation: 11
The above code does what you want.
dfnew=df.append(dfmatch,ignore_index=True)
defnew.drop_duplicates(subset=['column1', 'column2', 'column3'],
keep = 'first', inplace = True)
It adds dfmatch below df creating dfnew. Then it removes the duplicate rows only using column1, 2 and 3 as a subset. It keeps only the first occurrence that corresponds to the initial rows from df which include extra1 and extra2.
I wouldn't suggest though using float values as a subset due to the float precision handling in python. Rows with NaN on extra1 and extra2 indicate that were originally on dfmatch.
Upvotes: 0
Reputation: 26676
Sample data would have been useful. Lets try pd.merge
, indicator=
dfnew = pd.merge(df, dfmatch, how='left', indicator='Exist')
dfnew = dfnew .loc[dfnew ['Exist'] != 'both']
dfnew.drop(columns=['Exist'], inplace=True)
print(dfnew)
Upvotes: 2