Reputation: 689
I tried looking up for something similar but could find it. So I have below structure dataframe. I looking to delete rows that have same score for 5 days or more. So in below case PeronID AB-123's records from 2/1 to 2/6 needs to be deleted, also for DG-3465 from 2/3 to 2/10. But nothing for TY-9456. I was thinking for using rolling() but that will delete only 2/1-2/5 for AB-123 but not 2/6.
PersonID Date Score
AB-123 2/1/2016 0
AB-123 2/2/2016 0
AB-123 2/3/2016 0
AB-123 2/4/2016 0
AB-123 2/5/2016 0
AB-123 2/6/2016 0
AB-123 2/7/2016 67.5
AB-123 2/8/2016 73.4
AB-123 2/9/2016 70.5
AB-123 2/10/2016 68
DG-3465 2/1/2016 22.5
DG-3465 2/2/2016 25.6
DG-3465 2/3/2016 36.4
DG-3465 2/4/2016 36.4
DG-3465 2/5/2016 36.4
DG-3465 2/6/2016 36.4
DG-3465 2/7/2016 36.4
DG-3465 2/8/2016 36.4
DG-3465 2/9/2016 36.4
DG-3465 2/10/2016 36.4
TY-9456 2/1/2016 0
TY-9456 2/2/2016 0
TY-9456 2/3/2016 5.23
TY-9456 2/4/2016 4.12
TY-9456 2/5/2016 5.95
TY-9456 2/6/2016 6.97
TY-9456 2/7/2016 12.45
TY-9456 2/8/2016 15.61
TY-9456 2/9/2016 15.61
TY-9456 2/10/2016 15.61
Tried few different things but I am kinda stuck with nothing poping up in my head. What would you suggest? Using python pandas by the way ;)
Upvotes: 3
Views: 765
Reputation: 214927
You can roll
on Score column, calculate the running standard deviation, and then drop rows where the standard deviations are zero along with five rows before them (this assumes you want to delete rows with same scores on consecutive days):
df.drop(np.unique(df.Score.rolling(5).std()[lambda x: x == 0].index.values - pd.np.arange(5)[:, None]))
Upvotes: 3
Reputation: 38415
You groupby shift and cumsum(). Edited to include @Scott Boston's suggestion
df.groupby(['PersonID',(df.Score != df.Score.shift()).cumsum()]).filter(lambda x: x.Score.size < 5)
PersonID Date Score
6 AB-123 2/7/2016 67.50
7 AB-123 2/8/2016 73.40
8 AB-123 2/9/2016 70.50
9 AB-123 2/10/2016 68.00
10 DG-3465 2/1/2016 22.50
11 DG-3465 2/2/2016 25.60
20 TY-9456 2/1/2016 0.00
21 TY-9456 2/2/2016 0.00
22 TY-9456 2/3/2016 5.23
23 TY-9456 2/4/2016 4.12
24 TY-9456 2/5/2016 5.95
25 TY-9456 2/6/2016 6.97
26 TY-9456 2/7/2016 12.45
27 TY-9456 2/8/2016 15.61
28 TY-9456 2/9/2016 15.61
29 TY-9456 2/10/2016 15.61
Upvotes: 3
Reputation: 375405
You can exclude those rows who are different by 0 and offset by 1 day:
In [11]: df[(df.Score.diff() != 0) | (df.Date.diff() != pd.offsets.Day().delta)]
Out[11]:
PersonID Date Score
0 AB-123 2016-02-01 0.00
6 AB-123 2016-02-07 67.50
7 AB-123 2016-02-08 73.40
8 AB-123 2016-02-09 70.50
9 AB-123 2016-02-10 68.00
10 DG-3465 2016-02-01 22.50
11 DG-3465 2016-02-02 25.60
12 DG-3465 2016-02-03 36.40
20 TY-9456 2016-02-01 0.00
22 TY-9456 2016-02-03 5.23
23 TY-9456 2016-02-04 4.12
24 TY-9456 2016-02-05 5.95
25 TY-9456 2016-02-06 6.97
26 TY-9456 2016-02-07 12.45
27 TY-9456 2016-02-08 15.61
Upvotes: 0