Reputation: 191
I have a dataframe and I want to drop duplicates based on different conditions....
A B
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
8 - 5.1
9 - 5.3
I want to drop all the duplicates from column A except rows with "-". After this, I want to drop duplicates from column A with "-" as a value based on their column B value. Given the input dataframe, this should return the following:-
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3
I have the following code but it's not very efficient for very large amounts of data, how can I improve this....
def generate(df):
str_col = df[df["A"] == "-"]
df.drop(df[df["A"] == "-"].index, inplace=True)
df = df.drop_duplicates(subset="A")
str_col = b.drop_duplicates(subset="B")
bigdata = df.append(str_col, ignore_index=True)
return bigdata.sort_values("B")
Upvotes: 3
Views: 1081
Reputation: 323226
groupby
+ head
df.groupby(['A','B']).head(1)
Out[7]:
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3
Upvotes: 2
Reputation: 13868
df.drop_duplicates(subset=['A', 'B'])
Given a full set of data:
A B C
0 1 1.0 0
1 1 1.0 1
2 2 2.0 2
3 2 2.0 3
4 3 3.0 4
5 4 4.0 5
6 5 5.0 6
7 - 5.1 7
8 - 5.1 8
9 - 5.3 9
Result:
A B C
0 1 1.0 0
2 2 2.0 2
4 3 3.0 4
5 4 4.0 5
6 5 5.0 6
7 - 5.1 7
9 - 5.3 9
Upvotes: 2
Reputation: 150735
duplicated
and eq
:
df[~df.duplicated('A') # keep those not duplicates in A
| (df['A'].eq('-') # or those '-' in A
& ~df['B'].duplicated())] # which are not duplicates in B
Output:
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3
Upvotes: 7