Roy
Roy

Reputation: 1044

Filter Dataframe Based on Local Minima with Increasing Timeline

EDITED:

I have the following dataframe of students with their exam scores in different dates (sorted):

df = pd.DataFrame({'student': 'A A A B B B B C C'.split(),
                  'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
                               datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
                               datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
                               datetime.datetime(2014,2,2),datetime.datetime(2013,7,1),
                               datetime.datetime(2013,9,2),],
                   'score': [15, 17, 32, 22, 28, 24, 33, 33, 15]})

print(df)

  student  exam_date  score
0       A 2013-04-01     15
1       A 2013-06-01     17
2       A 2013-07-01     32
3       B 2013-09-02     22
4       B 2013-10-01     28
5       B 2013-11-02     24
6       B 2014-02-02     33
7       C 2013-07-01     33
8       C 2013-09-02     15

I need to keep only those rows where the score is increased by more than 10 from the local minima.

For example, for the student A, the local minima is 15 and the score is increased to 32 in the next-to-to date, so we're gonna keep that.

For the student B, no score is increased by more than 10 from local minima. 28-22 and 33-24 both are less than 10.

For the student C, the local minima is 15, but the score isn't increased after that, so we're gonna drop that.

I'm trying the following script:

out = df[df['score'] - df.groupby('student', as_index=False)['score'].cummin()['score']>= 10]

print(out)
2   A   2013-07-01  32
6   B   2014-02-02  33 #--Shouldn't capture this as it's increased by `9` from local minima of `24`

Desired output:

   student  exam_date  score
2        A  2013-07-01  32

# For A, score of 32 is increased by 17 from local minima of 15  

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

Upvotes: 0

Views: 83

Answers (3)

Corralien
Corralien

Reputation: 120409

Assuming your dataframe is already sorted by date:

highest_score = lambda x: x['score'] - x['score'].mask(x['score'].gt(x['score'].shift())).ffill() > 10
out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)

# Output
  student  exam_date  score
2       A 2013-07-01     32

Focus on lambda function

Let's modify your dataframe and extract one student to avoid groupby:

>>> df = df[df['student'] == 'B']
  student  exam_date  score
3       B 2013-09-02     22
4       B 2013-10-01     28
5       B 2013-11-02     24
6       B 2014-02-02     33

# Step-1: find row where value is not a local minima
>>> df['score'].gt(df['score'].shift())
3    False
4     True
5    False
6     True
Name: score, dtype: bool

# Step-2: hide non local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3    22.0
4     NaN
5    24.0
6     NaN
Name: score, dtype: float64

# Step-3: fill forward local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3    22.0
4    22.0
5    24.0
6    24.0
Name: score, dtype: float64

# Step-4: check if the condition is True
>>> df['score'] - df['score'].mask(df['score'].gt(df['score'].shift())) > 10
3    False
4    False
5    False
6    False
Name: score, dtype: bool

Upvotes: 1

user7864386
user7864386

Reputation:

We could try the following:

  1. Find the difference between consecutive scores for each student using groupby + diff.

  2. using where, assign NaN values to all rows where the score difference is less than 10

  3. use groupby + first to get the first score differences greater than 10 for each student.

msk = (diff>10) | (diff.groupby([diff[::-1].shift().lt(0).cumsum()[::-1], df['student']]).cumsum()>10)
out = df.where(msk).groupby('student').first().reset_index()

Output:

  student  exam_date  score
0       A 2013-06-01   27.0
1       B 2013-10-01   43.0

Upvotes: 1

user17242583
user17242583

Reputation:

Incorporating @Corralien's solution you posted, I've come up with a one-liner that works nicely:

filtered = df.groupby('student', as_index=False).apply(lambda x: None if (v := (x['score'].cummax() * (x['score'] > x['score'].shift()) - (x['score'].cummin()) >= 10)).sum() == 0 else x.loc[v.idxmax()] ).dropna()

Output:

>>> filtered
  student  exam_date  score
0       A 2013-06-01   27.0
1       B 2013-10-01   43.0

Upvotes: 0

Related Questions