Ricky Kim
Ricky Kim

Reputation: 2022

Remove rows that are in another dataframe when there are duplicates

I want to remove rows that are in one dataframe, if another dataframe has the same rows. However, I don't want to remove all the rows, only the number of rows that are in the other dataframe. Refer to this example:

df1

   col1  col2
0     1    10
1     1    10
2     2    11
3     3    12
4     1    10

df2

   col1  col2
0     1    10
1     2    11
2     1    10
3     3    12
4     3    12

Desired output:

df1

   col1  col2
      1    10

Because df1 has 3 rows of 1,10 while df2 has 2 rows of 1,10 so you remove 2 from each, leaving 1 for df1. If there were 4 rows in df1, I would want two rows of 1,10 in df1 as a result. Same with df2 below:

df2

   col1  col2
      3    12

My attempt:

I was maybe thinking of counting how many duplicates are in each of the dataframe and creating new df1 and df2 by subtracting the dupe_count but wondering if there's a more efficient way.

df1g=df1.groupby(df1.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})
df2g=df2.groupby(df2.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})

Upvotes: 3

Views: 502

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150725

Here's another approach which use repeat:

# count of the rows
c1 = df1.groupby(['col1', 'col2']).size()
c2 = df2.groupby(['col1', 'col2']).size()

# repeat the rows by values
(c1.repeat((c1-c2).clip(0))
   .reset_index()
   .drop(0, axis=1)
)
#   col1    col2
# 0 1   10

(c2.repeat((c2-c1).clip(0))
   .reset_index()
   .drop(0, axis=1)
)
#   col1    col2
# 0 3   12

Upvotes: 1

cs95
cs95

Reputation: 402263

This is a non-trivial problem, but merge is your friend:

a, b = (df.assign(count=df.groupby([*df]).cumcount()) for df in (df1, df2))    
df1[a.merge(b, on=[*a], indicator=True, how='left').eval('_merge == "left_only"')]

   col1  col2
4     1    10

The idea here is to add a cumcount column to de-duplicate the columns (assign a unique identifier to each). We can then see which rows aren't matched on a subsequent merge.

a
   col1  col2  count
0     1    10      0
1     1    10      1
2     2    11      0
3     3    12      0
4     1    10      2

b
   col1  col2  count
0     1    10      0
1     2    11      0
2     1    10      1
3     3    12      0
4     3    12      1

a.merge(b, on=[*a], indicator=True, how='left')
   col1  col2  count     _merge
0     1    10      0       both
1     1    10      1       both
2     2    11      0       both
3     3    12      0       both
4     1    10      2  left_only

_.eval('_merge == "left_only"')
0    False
1    False
2    False
3    False
4     True
dtype: bool

If you need to get unmatched rows from both df1 and df2, use an outer merge:

out = a.merge(b, on=[*a], indicator=True, how='outer')
df1_filter = (
    out.query('_merge == "left_only"').drop(['count','_merge'], axis=1))
df2_filter = (
    out.query('_merge == "right_only"').drop(['count','_merge'], axis=1))

df1_filter
   col1  col2
4     1    10

df2_filter
   col1  col2
5     3    12

Upvotes: 3

Related Questions