Reputation: 6159
I have a df,
Sr.No Name Class Data
0 1 Sri 1 sri is a good player
1 '' Sri 2 sri is good in cricket
2 '' Sri 3 sri went out
3 2 Ram 1 Ram is a good player
4 '' Ram 2 sri is good in cricket
5 '' Ram 3 Ram went out
6 3 Sri 1 sri is a good player
7 '' Sri 2 sri is good in cricket
8 '' Sri 3 sri went out
9 4 Sri 1 sri is a good player
10 '' Sri 2 sri is good in cricket
11 '' Sri 3 sri went out
12 '' Sri 4 sri came back
I am trying to drop duplicates based on ["Name","Class","Data"]. The goal is to drop duplicates based on all sentences per Sr No.
My expected output is,
out_df
Sr.No Name Class Data
0 1 Sri 1 sri is a good player
1 Sri 2 sri is good in cricket
2 Sri 3 sri went out
3 2 Ram 1 Ram is a good player
4 Ram 2 sri is good in cricket
5 Ram 3 Ram went out
9 4 Sri 1 sri is a good player
10 Sri 2 sri is good in cricket
11 Sri 3 sri went out
12 Sri 4 sri came back
Upvotes: 2
Views: 372
Reputation: 402363
Create a dummy column with a groupby
+ transform
operation.
v = df.groupby(df['Class'].diff().le(0).cumsum())['Data'].transform(' '.join)
Or,
v = df['Data'].groupby(df['Class'].diff().le(0).cumsum()).transform(' '.join)
This dummy column becomes a factor when deciding what rows are to be dropped.
m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"])
df[~m]
Class Data Name Sr.No
0 1 sri is a good player Sri 1
1 2 sri is good in cricket Sri
2 3 sri went out Sri
3 1 Ram is a good player Ram 2
4 2 sri is good in cricket Ram
5 3 Ram went out Ram
9 1 sri is a good player Sri 4
10 2 sri is good in cricket Sri
11 3 sri went out Sri
12 4 sri came back Sri
Details
Form groups from the monotonically increasing Class
values -
i = df['Class'].diff().le(0).cumsum()
i
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
11 3
12 3
Name: Class, dtype: int64
Use this to group, and transform Data
with a str.join
operation -
v = df.groupby(i)['Data'].transform(' '.join)
Which is simply a column of joined strings. Finally, assign the dummy column and call duplicated
-
m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"])
m
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 True
9 False
10 False
11 False
12 False
dtype: bool
Upvotes: 2