Reputation: 935
I have a dataframe like this:
ID1 ID2
0 foo bar
1 fizz buzz
And another like this:
ID1 ID2 Count Code
0 abc def 1 A
1 fizz buzz 5 A
2 fizz1 buzz2 3 C
3 foo bar 6 Z
4 foo bar 6 Z
What I would like to do is filter the second dataframe where ID1 and ID2 match a row in the first dataframe, and whenever there's a match I want to remove that row from the first dataframe to avoid duplicates. This would yield a dataframe that looks like this:
ID1 ID2 Count Code
1 fizz buzz 5 A
3 foo bar 6 Z
I know I can do this by nesting for loops, stepping through all the rows, and manually removing a row from the first frame whenever I get a match but I am wondering if there is a more pythonic way to do this. I am not experienced in pandas so there may be a much cleaner way to do that I do not know about. I was previously using .isin()
but had to scrap it. Each ID pair can exist in the dataframe up to N times and I need the filtered frame to contain between 0 and N instances of a pair.
Upvotes: 11
Views: 9600
Reputation: 935
Merge was almost what I wanted, but didn't quite do the job because I have an odd set of requirements where I need to filter out some duplicates but not all the duplicates. A regular merge doesn't work because that keeps all the duplicates and drop_duplicates()
doesn't work because I need to allow some duplicates. I ended up going with the method I described in the question and nested for loops.
temp_frame = pd.DataFrame(columns.df2.columns)
for i in xrange(len(df2)):
for ii in xrange(len(df1)):
if df2['ID1'].iloc[i] == df1['ID1'].iloc[ii] and df2['ID2'].iloc[i] == df1['ID2'].iloc[ii]:
df1.drop(df1.index[ii], inplace=True)
temp_frame = temp_frame.append(df2.iloc[i], ignore_index=True)
break
df1 = temp_frame.copy()
Upvotes: 2
Reputation: 862581
Use merge
with drop_duplicates
, if only same columns for join in both df
:
df = pd.merge(df1,df2.drop_duplicates())
print (df)
ID1 ID2 Count Code
0 foo bar 6 Z
1 fizz buzz 5 A
If need check dupes only in ID
columns:
df = pd.merge(df1,df2.drop_duplicates(subset=['ID1','ID2']))
print (df)
ID1 ID2 Count Code
0 foo bar 6 Z
1 fizz buzz 5 A
If more columns are overlaping add parameter on
:
df = pd.merge(df1, df2.drop_duplicates(), on=['ID1','ID2'])
If not remove dupe rows:
df = pd.merge(df1,df2)
print (df)
ID1 ID2 Count Code
0 foo bar 6 Z
1 foo bar 6 Z
2 fizz buzz 5 A
Upvotes: 9
Reputation: 294228
Using isin
on a list of tuples
df2[
pd.Series(
list(zip(df2.ID1.values, df2.ID2.values))
).isin(list(zip(df1.ID1.values, df1.ID2.values)))
]
ID1 ID2 Count Code
1 fizz buzz 5 A
3 foo bar 6 Z
4 foo bar 6 Z
Upvotes: 4
Reputation: 323226
Or maybe try this ?
df.loc[(df.ID1.isin(df1.ID1))&(df.ID2.isin(df1.ID2)),:].drop_duplicates()
Out[224]:
ID1 ID2 Count Code
1 fizz buzz 5 A
3 foo bar 6 Z
Upvotes: 3