mortysporty
mortysporty

Reputation: 2889

Remove duplicates according to multiple interpendant conditions in Python

I want to remove duplicates from a dataframe given some rather elaborate coditions. Consider the dataframe

df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3],
                   'dup' : [1, 2, 2, 2, 3, 4, 4, 5, 5],
                   'a' : [False, True, False, True, False, True, False, False, False],
                   'b' : [False, False, True, True, False, False, False, False, False],
                   'data' : [0, 1, 2, 3, 0, 1, 2, 0, 0]})


       a      b  data  dup  id
0  False  False     0    1   1
1   True  False     1    2   1
2  False   True     2    2   1
3   True   True     3    2   1
4  False  False     0    3   2
5   True  False     1    4   2
6  False  False     2    4   2
7  False  False     0    5   3
8  False  False     0    5   3

id indicates which rows belong together and dup is the variable on which the data is considered duplicated. So, rows 0 to 3 belong together and rows 1 to 3 are duplicates. And similarily rows 4 to 6 belong together, row 5 and 6 are duplicates etc.

I want to remove the duplicates according to the following rules :

  1. if there exists a row such that BOTH column a AND column b are True, keep that row and drop the rest.
  2. if condition 1 does not happen, if there exists a row such that EITHER column a OR column b is True, keep that row and drop the rest
  3. if neither 1 or 2 happens, keep one of the duplicated rows. It does not matter which.

The resulting frame should look like this

       a      b  data  dup  id
0  False  False     0    1   1
3   True   True     3    2   1
4  False  False     0    3   2
5   True  False     1    4   2
7  False  False     0    5   3

Regarding cases where more than two duplicated rows obey the same rule like this

       a      b  data  dup  id
0  False  False     0    3   2
1   True  False     1    4   2
2  False   True     2    4   2

where both row 1 and 2 are true according to rule 2. Those cases are not possible according to the logic leading up to the dataframe.

Upvotes: 0

Views: 1741

Answers (1)

Nicola Pellicanò
Nicola Pellicanò

Reputation: 469

Here you are:

df=df.sort_values(by=['dup','a','b'],ascending=[True,False,False])
df=df.drop_duplicates(subset='dup',keep='first')

First I sorted the rows with respect to your rule: by dup, and then put first the ones with True on 'a' (and possibly on 'b' too). Then drop duplicate rows with respect to 'dup' keeping the first encountered.

Upvotes: 3

Related Questions