Reputation: 2022
I want to remove rows that are in one dataframe, if another dataframe has the same rows. However, I don't want to remove all the rows, only the number of rows that are in the other dataframe. Refer to this example:
df1
col1 col2
0 1 10
1 1 10
2 2 11
3 3 12
4 1 10
df2
col1 col2
0 1 10
1 2 11
2 1 10
3 3 12
4 3 12
Desired output:
df1
col1 col2
1 10
Because df1
has 3 rows of 1,10 while df2
has 2 rows of 1,10 so you remove 2 from each, leaving 1 for df1
. If there were 4 rows in df1
, I would want two rows of 1,10 in df1
as a result. Same with df2
below:
df2
col1 col2
3 12
My attempt:
I was maybe thinking of counting how many duplicates are in each of the dataframe and creating new df1
and df2
by subtracting the dupe_count
but wondering if there's a more efficient way.
df1g=df1.groupby(df1.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})
df2g=df2.groupby(df2.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})
Upvotes: 3
Views: 502
Reputation: 150725
Here's another approach which use repeat
:
# count of the rows
c1 = df1.groupby(['col1', 'col2']).size()
c2 = df2.groupby(['col1', 'col2']).size()
# repeat the rows by values
(c1.repeat((c1-c2).clip(0))
.reset_index()
.drop(0, axis=1)
)
# col1 col2
# 0 1 10
(c2.repeat((c2-c1).clip(0))
.reset_index()
.drop(0, axis=1)
)
# col1 col2
# 0 3 12
Upvotes: 1
Reputation: 402263
This is a non-trivial problem, but merge
is your friend:
a, b = (df.assign(count=df.groupby([*df]).cumcount()) for df in (df1, df2))
df1[a.merge(b, on=[*a], indicator=True, how='left').eval('_merge == "left_only"')]
col1 col2
4 1 10
The idea here is to add a cumcount
column to de-duplicate the columns (assign a unique identifier to each). We can then see which rows aren't matched on a subsequent merge.
a
col1 col2 count
0 1 10 0
1 1 10 1
2 2 11 0
3 3 12 0
4 1 10 2
b
col1 col2 count
0 1 10 0
1 2 11 0
2 1 10 1
3 3 12 0
4 3 12 1
a.merge(b, on=[*a], indicator=True, how='left')
col1 col2 count _merge
0 1 10 0 both
1 1 10 1 both
2 2 11 0 both
3 3 12 0 both
4 1 10 2 left_only
_.eval('_merge == "left_only"')
0 False
1 False
2 False
3 False
4 True
dtype: bool
If you need to get unmatched rows from both df1
and df2
, use an outer merge:
out = a.merge(b, on=[*a], indicator=True, how='outer')
df1_filter = (
out.query('_merge == "left_only"').drop(['count','_merge'], axis=1))
df2_filter = (
out.query('_merge == "right_only"').drop(['count','_merge'], axis=1))
df1_filter
col1 col2
4 1 10
df2_filter
col1 col2
5 3 12
Upvotes: 3