Reputation: 214
I've got a df below with numerous duplicate values. Using below, I'm aiming to drop rows where Value
is unique compared to the previous rows and Group
is equal to C
.
Further, where this occurs I want to remove all previous duplicate rows.
d = {'Item': ["Red", "Red", "Red", "Green", "Green", "Red", "Red", "Red", "Green", "Green", "Green", "Green", "Red", "Red", "Red", "Green"],
'Value': [1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6],
'Group': ["A", "B", "B", "C", "D", "D", "A", "B", "C", "D", "E", "E", "B", "B", "D", "D"],
}
df = pd.DataFrame(data=d)
mask = (df['Item'].isin(['Green'])) & (df.Value.eq(df.Value.shift(-1)))
df = df[~mask]
The input is:
Item Value Group
0 Red 1 A
1 Red 1 B
2 Red 1 B
3 Green 2 C
4 Green 2 D
5 Red 3 D
6 Red 3 A
7 Red 3 B
8 Green 4 C
9 Green 4 D
10 Green 4 E
11 Green 4 E
12 Red 5 B
13 Red 5 B
14 Red 5 D
15 Green 6 D
intended output:
Item Value Group
0 Red 1 A
4 Green 2 D
5 Red 3 D
6 Red 3 A
9 Green 4 D
10 Green 4 E
11 Green 4 E
12 Red 5 B
13 Red 5 B
14 Red 5 D
15 Green 6 D
current output:
Item Value Group
0 Red 1 A
1 Red 1 B
2 Red 1 B
4 Green 2 D
5 Red 3 D
6 Red 3 A
7 Red 3 B
11 Green 4 E
12 Red 5 B
13 Red 5 B
14 Red 5 D
15 Green 6 D
Upvotes: 1
Views: 242
Reputation: 24304
IIUC:
Another possible way:
out=df.shift(-1)
#shifting the dataframe
cond=df['Value'].ne(out['Value']) & out['Group'].eq('C')
#your initial condition
df=df[~cond].drop(df[cond].index+1)
#filtering out data
output of df
:
Item Value Group
0 Red 1 A
3 Green 2 D
4 Red 3 E
5 Red 3 A
6 Red 3 B
7 Green 4 B
8 Green 4 D
9 Green 4 E
10 Green 4 A
11 Red 5 B
12 Red 5 C
13 Red 5 D
14 Green 6 E
Update:
IIUC:
Try:
create a custom function which will calculate your conditions and give the index of rows where the condition match
def getindicies():
out=df.shift(-1)
cond=df['Value'].ne(out['Value']) & out['Group'].eq('C')
idx=df.groupby('Group').filter(lambda x:x.duplicated(keep=False).any()).index
to_drop=df.loc[:(df[cond].index+1)[-1]]
return to_drop[to_drop.index.isin(idx)].index.tolist()+(df[cond].index+1).tolist()
#Finally call the function and remove those values:
df=df.drop(getindicies())
If you print df
you will get your desired output
OR
let's extend the logic of this answer
:
cond = df.Value.diff().ne(0) & df.Group.eq("C")
duplicates=df.groupby('Group').transform(lambda x:x.duplicated(keep=False).any()).all(1)
to_drop = (cond | cond.shift(-1)) | duplicates
df=df[~to_drop].loc[:df[cond].index[-1]].append(df.loc[df[cond].index[-1]+1:])
#OR
df=pd.concat([df[~to_drop].loc[:df[cond].index[-1]],df.loc[df[cond].index[-1]+1:]])
If you print df
you will get your desired output
Upvotes: 1
Reputation: 26676
another way.
m=df['Value'].duplicated(keep='last')&df['Group'].ne('C')
n=(df['Group'].ne('C')&df['Group'].shift(-1).eq('C'))|(df['Group'].eq('C'))
df[~(m&n)]
Upvotes: 1
Reputation: 18296
# form the condition
cond = df.Value.diff().ne(0) & df.Group.eq("C")
# also consider the previous row
to_drop = cond | cond.shift(-1)
# index with inverse of the mask
new_df = df[~to_drop]
Locations where Value
is not equal to the previous one can be found via looking at the difference not being 0:
df.Value.diff().ne(0)
and Group
being "C"
is found with
df.Group.eq("C")
Combining them with "and":
cond = df.Value.diff().ne(0) & df.Group.eq("C")
Since you want to drop the previous row too, we can "or" with the shifted version of this:
to_drop = cond | cond.shift(-1)
which gives
>>> to_drop
0 False
1 True
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
Therefore final operation is indexing with inverse of this:
>>> new_df = df[~to_drop]
>>> new_df
Item Value Group
0 Red 1 A
3 Green 2 D
4 Red 3 E
5 Red 3 A
6 Red 3 B
7 Green 4 B
8 Green 4 D
9 Green 4 E
10 Green 4 A
11 Red 5 B
12 Red 5 C
13 Red 5 D
14 Green 6 E
Upvotes: 3