Reputation: 2889
I want to remove duplicates from a dataframe given some rather elaborate coditions. Consider the dataframe
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3],
'dup' : [1, 2, 2, 2, 3, 4, 4, 5, 5],
'a' : [False, True, False, True, False, True, False, False, False],
'b' : [False, False, True, True, False, False, False, False, False],
'data' : [0, 1, 2, 3, 0, 1, 2, 0, 0]})
a b data dup id
0 False False 0 1 1
1 True False 1 2 1
2 False True 2 2 1
3 True True 3 2 1
4 False False 0 3 2
5 True False 1 4 2
6 False False 2 4 2
7 False False 0 5 3
8 False False 0 5 3
id
indicates which rows belong together and dup
is the variable on which the data is considered duplicated. So, rows 0 to 3 belong together and rows 1 to 3 are duplicates. And similarily rows 4 to 6 belong together, row 5 and 6 are duplicates etc.
I want to remove the duplicates according to the following rules :
a
AND column b
are True
, keep that row and drop the rest.a
OR column b
is True
, keep that row and drop the restThe resulting frame should look like this
a b data dup id
0 False False 0 1 1
3 True True 3 2 1
4 False False 0 3 2
5 True False 1 4 2
7 False False 0 5 3
Regarding cases where more than two duplicated rows obey the same rule like this
a b data dup id
0 False False 0 3 2
1 True False 1 4 2
2 False True 2 4 2
where both row 1 and 2 are true according to rule 2. Those cases are not possible according to the logic leading up to the dataframe.
Upvotes: 0
Views: 1741
Reputation: 469
Here you are:
df=df.sort_values(by=['dup','a','b'],ascending=[True,False,False])
df=df.drop_duplicates(subset='dup',keep='first')
First I sorted the rows with respect to your rule: by dup, and then put first the ones with True on 'a' (and possibly on 'b' too). Then drop duplicate rows with respect to 'dup' keeping the first encountered.
Upvotes: 3