Remove duplicates according to multiple interpendant conditions in Python

Question

I want to remove duplicates from a dataframe given some rather elaborate coditions. Consider the dataframe

df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3],
                   'dup' : [1, 2, 2, 2, 3, 4, 4, 5, 5],
                   'a' : [False, True, False, True, False, True, False, False, False],
                   'b' : [False, False, True, True, False, False, False, False, False],
                   'data' : [0, 1, 2, 3, 0, 1, 2, 0, 0]})


       a      b  data  dup  id
0  False  False     0    1   1
1   True  False     1    2   1
2  False   True     2    2   1
3   True   True     3    2   1
4  False  False     0    3   2
5   True  False     1    4   2
6  False  False     2    4   2
7  False  False     0    5   3
8  False  False     0    5   3

id indicates which rows belong together and dup is the variable on which the data is considered duplicated. So, rows 0 to 3 belong together and rows 1 to 3 are duplicates. And similarily rows 4 to 6 belong together, row 5 and 6 are duplicates etc.

I want to remove the duplicates according to the following rules :

if there exists a row such that BOTH column a AND column b are True, keep that row and drop the rest.
if condition 1 does not happen, if there exists a row such that EITHER column a OR column b is True, keep that row and drop the rest
if neither 1 or 2 happens, keep one of the duplicated rows. It does not matter which.

The resulting frame should look like this

       a      b  data  dup  id
0  False  False     0    1   1
3   True   True     3    2   1
4  False  False     0    3   2
5   True  False     1    4   2
7  False  False     0    5   3

Regarding cases where more than two duplicated rows obey the same rule like this

       a      b  data  dup  id
0  False  False     0    3   2
1   True  False     1    4   2
2  False   True     2    4   2

where both row 1 and 2 are true according to rule 2. Those cases are not possible according to the logic leading up to the dataframe.

Nicola Pellican&#242; · Accepted Answer

Here you are:

df=df.sort_values(by=['dup','a','b'],ascending=[True,False,False])
df=df.drop_duplicates(subset='dup',keep='first')

First I sorted the rows with respect to your rule: by dup, and then put first the ones with True on 'a' (and possibly on 'b' too). Then drop duplicate rows with respect to 'dup' keeping the first encountered.

Remove duplicates according to multiple interpendant conditions in Python

Answers (1)

Related Questions