Roy
Roy

Reputation: 1044

Filter Rows Based on Highest and Lowest Column Values

I have a unique dataframe:

df = pd.DataFrame({'student': 'A B C D'.split(),
                  'score1':[20, 15, 30, 22],
                   'score2': [15, 22, 35, 18],
                   'score3': [24, 32, 38, 25],
                   'score4': [20, 20, 26, 30]})

print(df)

  student  score1  score2  score3  score4
0       A      20      15      24      20
1       B      15      22      32      20
2       C      30      35      38      26
3       D      22      18      25      30

I need to keep only those rows where the highest score is increased by more than 10 from the lowest score, otherwise drop them.

For example, for the student A, the lowest score is 15 and after that the score is increased to 24 (by 9), so we're gonna drop that.

For the student B, the lowest score is 15 and the score is increased to 32, so we're gonna keep that.

For the student C, the lowest score is 26, but there no score is increased after that. It is basically decreased, so we're gonna drop that.

I know that diff() and ge() would be helpful here, but not sure how they would work when the lowest score (has to be on the left of highest score) and highest score (has to be on the right side of lowest score) are many columns apart.

Desired output:

name

B #--highest score of 32 (score3) increased by 17 from lowest score of 15 (score1)  
D #--highest score of 30 (score4) increased by 12 from lowest score of 18 (score2) 

Any suggestions would be appreciated. Thanks!

Upvotes: 0

Views: 223

Answers (2)

sophocles
sophocles

Reputation: 13821

You can start by sorting your dataframe along the columns, so that your score columns are in the right order (score1 -> score4) using sort_index. Then, you can get the min(1) score of each student and the corresponding column where the minimum occured, using idxmin(1) (same approach for the maximum):

# Sort Index
df.sort_index(axis=1,inplace=True) 
sc = df.filter(like='score').columns

# Max score with corresponding column
ma = pd.concat([df[sc].idxmax(1),df[sc].max(1)],axis=1)
mi = pd.concat([df[sc].idxmin(1),df[sc].min(1)],axis=1)

Lastly, you can use boolean indexing to compare the first column of max with the first column of min which will show whether the max score happened after the min score, and also compare whether the difference between those scores is greater than 10:

df.loc[(ma[0] > mi[0]) & (ma[1]-mi[1] > 10)]

Which will return:

   score1  score2  score3  score4 student
1      15      22      32      20       B
3      22      18      25      30       D

Upvotes: 3

Corralien
Corralien

Reputation: 120391

Try:

select_student = lambda x: x.sub(x.cummin()).gt(10).any()
out = df[df.filter(like='score').apply(select_student, axis=1)]
print(out)

# Output:
  student  score1  score2  score3  score4
1       B      15      22      32      20
3       D      22      18      25      30

Upvotes: 1

Related Questions