Reputation: 199
So I am stuck where I am not able to exactly frame my question as well as come up with a solution. Describing the problem below
I have a data frame with two-column and I want to remove the row with a duplicate value based on the value of col1
col1 | col2
A 1
A 1
B 1
A 2
B 3
A 4
B 4
Result
col1 | col2
A 1
A 1
A 2
B 3
A 4
So I want to remove the duplicate column with the condition that if there is a second occurrence of a value in col2 then I will give priority where the col1 value is 'A' and will the second occurrence where the col1 value is 'B'. I am really confused about how to implement this condition in python. And will be really grateful if someone can rephrase the question as well as the headline in simpler way
Upvotes: 0
Views: 40
Reputation: 453
Edit:
Based on the additional detail provided in the question after editing, the solution now finds a data frame of all duplicate values in col2
. We then find any duplicate values that have "B" in col1
and remove them from our original data frame.
For your example case:
import pandas as pd
df = pd.DataFrame({"col1": ["A", "A", "B", "A", "B", "A", "B"], "col2": [1,1,1,2,3,4,4]})
duplicates = df[df.duplicated(subset=['col2'], keep=False)]
duplicates_b_index = duplicates.index[duplicates['col1'] == "B"].tolist()
df = df.drop(duplicates_b_index)
print(df)
This then gives you
col1 col2
0 A 1
1 A 1
3 A 2
4 B 3
5 A 4
Upvotes: 1