BrownBatman
BrownBatman

Reputation: 199

Removing the duplicate row on the basis for certain column value(giving high priority to one column value than other)

So I am stuck where I am not able to exactly frame my question as well as come up with a solution. Describing the problem below

I have a data frame with two-column and I want to remove the row with a duplicate value based on the value of col1

col1 | col2
A       1
A       1    
B       1
A       2
B       3
A       4
B       4

Result
 
col1 | col2
A       1
A       1
A       2
B       3
A       4

So I want to remove the duplicate column with the condition that if there is a second occurrence of a value in col2 then I will give priority where the col1 value is 'A' and will the second occurrence where the col1 value is 'B'. I am really confused about how to implement this condition in python. And will be really grateful if someone can rephrase the question as well as the headline in simpler way

Upvotes: 0

Views: 40

Answers (1)

H_Boofer
H_Boofer

Reputation: 453

Edit:

Based on the additional detail provided in the question after editing, the solution now finds a data frame of all duplicate values in col2. We then find any duplicate values that have "B" in col1 and remove them from our original data frame.

For your example case:

import pandas as pd

df = pd.DataFrame({"col1": ["A", "A", "B", "A", "B", "A", "B"], "col2": [1,1,1,2,3,4,4]})

duplicates = df[df.duplicated(subset=['col2'], keep=False)]
duplicates_b_index = duplicates.index[duplicates['col1'] == "B"].tolist()

df = df.drop(duplicates_b_index)

print(df)

This then gives you

  col1  col2
0    A     1
1    A     1
3    A     2
4    B     3
5    A     4

Upvotes: 1

Related Questions