PyRaider
PyRaider

Reputation: 689

Pandas delete rows if value is same for certain date range

I tried looking up for something similar but could find it. So I have below structure dataframe. I looking to delete rows that have same score for 5 days or more. So in below case PeronID AB-123's records from 2/1 to 2/6 needs to be deleted, also for DG-3465 from 2/3 to 2/10. But nothing for TY-9456. I was thinking for using rolling() but that will delete only 2/1-2/5 for AB-123 but not 2/6.

PersonID    Date    Score
AB-123  2/1/2016    0
AB-123  2/2/2016    0
AB-123  2/3/2016    0
AB-123  2/4/2016    0
AB-123  2/5/2016    0
AB-123  2/6/2016    0
AB-123  2/7/2016    67.5
AB-123  2/8/2016    73.4
AB-123  2/9/2016    70.5
AB-123  2/10/2016   68
DG-3465 2/1/2016    22.5
DG-3465 2/2/2016    25.6
DG-3465 2/3/2016    36.4
DG-3465 2/4/2016    36.4
DG-3465 2/5/2016    36.4
DG-3465 2/6/2016    36.4
DG-3465 2/7/2016    36.4
DG-3465 2/8/2016    36.4
DG-3465 2/9/2016    36.4
DG-3465 2/10/2016   36.4
TY-9456 2/1/2016    0
TY-9456 2/2/2016    0
TY-9456 2/3/2016    5.23
TY-9456 2/4/2016    4.12
TY-9456 2/5/2016    5.95
TY-9456 2/6/2016    6.97
TY-9456 2/7/2016    12.45
TY-9456 2/8/2016    15.61
TY-9456 2/9/2016    15.61
TY-9456 2/10/2016   15.61

Tried few different things but I am kinda stuck with nothing poping up in my head. What would you suggest? Using python pandas by the way ;)

Upvotes: 3

Views: 765

Answers (3)

akuiper
akuiper

Reputation: 214927

You can roll on Score column, calculate the running standard deviation, and then drop rows where the standard deviations are zero along with five rows before them (this assumes you want to delete rows with same scores on consecutive days):

df.drop(np.unique(df.Score.rolling(5).std()[lambda x: x == 0].index.values - pd.np.arange(5)[:, None]))

enter image description here

Upvotes: 3

Vaishali
Vaishali

Reputation: 38415

You groupby shift and cumsum(). Edited to include @Scott Boston's suggestion

df.groupby(['PersonID',(df.Score != df.Score.shift()).cumsum()]).filter(lambda x: x.Score.size < 5)


    PersonID    Date    Score
6   AB-123  2/7/2016    67.50
7   AB-123  2/8/2016    73.40
8   AB-123  2/9/2016    70.50
9   AB-123  2/10/2016   68.00
10  DG-3465 2/1/2016    22.50
11  DG-3465 2/2/2016    25.60
20  TY-9456 2/1/2016    0.00
21  TY-9456 2/2/2016    0.00
22  TY-9456 2/3/2016    5.23
23  TY-9456 2/4/2016    4.12
24  TY-9456 2/5/2016    5.95
25  TY-9456 2/6/2016    6.97
26  TY-9456 2/7/2016    12.45
27  TY-9456 2/8/2016    15.61
28  TY-9456 2/9/2016    15.61
29  TY-9456 2/10/2016   15.61

Upvotes: 3

Andy Hayden
Andy Hayden

Reputation: 375405

You can exclude those rows who are different by 0 and offset by 1 day:

In [11]: df[(df.Score.diff() != 0) | (df.Date.diff() != pd.offsets.Day().delta)]
Out[11]:
   PersonID       Date  Score
0    AB-123 2016-02-01   0.00
6    AB-123 2016-02-07  67.50
7    AB-123 2016-02-08  73.40
8    AB-123 2016-02-09  70.50
9    AB-123 2016-02-10  68.00
10  DG-3465 2016-02-01  22.50
11  DG-3465 2016-02-02  25.60
12  DG-3465 2016-02-03  36.40
20  TY-9456 2016-02-01   0.00
22  TY-9456 2016-02-03   5.23
23  TY-9456 2016-02-04   4.12
24  TY-9456 2016-02-05   5.95
25  TY-9456 2016-02-06   6.97
26  TY-9456 2016-02-07  12.45
27  TY-9456 2016-02-08  15.61

Upvotes: 0

Related Questions