pandas - drop row based on first occurrence of condition

Question

Below is a sample of a larger dictionary

 data ={'StudentId':['AAdams','AAdams','AAdams','AAdams','AAdams','AAdams',
                'BBrooks','BBrooks','BBrooks','BBrooks','BBrooks',],

'year':[2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016],
'month':[11,11,11,11,12,12,12,12,12,1,1],
'day':[15,16,21,23,1,2,15,18,30,5,7],
'BookLevel':[1.5,1.5,1.2,1.4,1.7,1.8,3.2,3.1,3.8,3.3,3.4]}

I want to drop the first occurrence of the max value from the first month of each student. If there is a second occurrence I want to keep that occurrence

I used groupby by StudentId, year, and month and calculated max value of BookLevel.

I then did another dataframe to calculate the max value per month per student and merged the two dataframes

I then had a dataframe that looks as follows:

       StudentId year   month   BookLevel   BookLevelMax
    0   AAdams  2015    11      1.5         1.5
    1   AAdams  2015    11      1.5         1.5
    2   AAdams  2015    11      1.2         1.5
    3   AAdams  2015    11      1.4         1.5
    4   AAdams  2015    12      1.7         1.8
    5   AAdams  2015    12      1.8         1.8
    6   BBrooks 2015    12      3.2         3.8
    7   BBrooks 2015    12      3.1         3.8
    8   BBrooks 2015    12      3.8         3.8
    9   BBrooks 2016    1       3.3         3.4
   10   BBrooks 2016    1       3.4         3.4

I want to drop the first occurrence of 1.5 in month 11 for AAdams and the 3.8 row for BBrooks in month 12

I can use df.drop to remove rows where BookLevel Max is equal to BookLevel...But it drops the max in each month for each student and removes both the first occurrence and the second, if there are two.

       StudentId    year    month   BookLevel   BookLevelMax
  2     AAdams      2015    11        1.2        1.5
  3     AAdams      2015    11        1.4        1.5
  4     AAdams      2015    12        1.7        1.8
  6     BBrooks     2015    12        3.2        3.8
  7     BBrooks     2015    12        3.1        3.8
  9     BBrooks     2016    1         3.3        3.4

I can't seem to find a way to drop only the first occurrence and to only do it in the first month. To be specific AAdams has a max of 1.5 in month 11. It occurs on two occasions. I would like to keep one of the rows but drop othe other....AND I would want to keep the row with 1.8 under AAdams since it occurs in the second month (and other occurrences in other students in months after the first). Hope this makes sense.

D Hagen · Accepted Answer

Original Question (as stated above):

Below is a sample of a larger dictionary which became dfdata

data ={'StudentId':['AAdams','AAdams','AAdams','AAdams','AAdams','AAdams',
            'BBrooks','BBrooks','BBrooks','BBrooks','BBrooks',],

'year':[2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016],
'month':[11,11,11,11,12,12,12,12,12,1,1],
'day':[15,16,21,23,1,2,15,18,30,5,7],
'BookLevel':[1.5,1.5,1.2,1.4,1.7,1.8,3.2,3.1,3.8,3.3,3.4]}

I want to drop the first occurrence of the max value from the first month of each student. If there is a second occurrence I want to keep that occurrence. I did not want to change any month other than the first month for each student

This is my new solution:

 df =dfdata.sort_values(by=['StudentId','year','month','BookLevel'],ascending = [True,True,True,False])

This brought the highest BookLevel for the first month to the top of each group of StudentID, year, month.

Then I did a groupby using cumcount. This placed a zero next to each of the highest booklevel for each student in the first month.

 df1 = (df.groupby('StudentId').cumcount())

Then I did a groupby selecting all the rows without a zero

df2 = df(df.groupby('StudentId').cumcount() !=0])

Mission accomplished! Original try is a great example of looking at incremental progress but losing sight of the original goal.

pandas - drop row based on first occurrence of condition

Answers (2)

Related Questions