Reputation: 153
Below is a sample of a larger dictionary
data ={'StudentId':['AAdams','AAdams','AAdams','AAdams','AAdams','AAdams',
'BBrooks','BBrooks','BBrooks','BBrooks','BBrooks',],
'year':[2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016],
'month':[11,11,11,11,12,12,12,12,12,1,1],
'day':[15,16,21,23,1,2,15,18,30,5,7],
'BookLevel':[1.5,1.5,1.2,1.4,1.7,1.8,3.2,3.1,3.8,3.3,3.4]}
I want to drop the first occurrence of the max value from the first month of each student. If there is a second occurrence I want to keep that occurrence
I used groupby by StudentId, year, and month and calculated max value of BookLevel.
I then did another dataframe to calculate the max value per month per student and merged the two dataframes
I then had a dataframe that looks as follows:
StudentId year month BookLevel BookLevelMax
0 AAdams 2015 11 1.5 1.5
1 AAdams 2015 11 1.5 1.5
2 AAdams 2015 11 1.2 1.5
3 AAdams 2015 11 1.4 1.5
4 AAdams 2015 12 1.7 1.8
5 AAdams 2015 12 1.8 1.8
6 BBrooks 2015 12 3.2 3.8
7 BBrooks 2015 12 3.1 3.8
8 BBrooks 2015 12 3.8 3.8
9 BBrooks 2016 1 3.3 3.4
10 BBrooks 2016 1 3.4 3.4
I want to drop the first occurrence of 1.5 in month 11 for AAdams and the 3.8 row for BBrooks in month 12
I can use df.drop to remove rows where BookLevel Max is equal to BookLevel...But it drops the max in each month for each student and removes both the first occurrence and the second, if there are two.
StudentId year month BookLevel BookLevelMax
2 AAdams 2015 11 1.2 1.5
3 AAdams 2015 11 1.4 1.5
4 AAdams 2015 12 1.7 1.8
6 BBrooks 2015 12 3.2 3.8
7 BBrooks 2015 12 3.1 3.8
9 BBrooks 2016 1 3.3 3.4
I can't seem to find a way to drop only the first occurrence and to only do it in the first month. To be specific AAdams has a max of 1.5 in month 11. It occurs on two occasions. I would like to keep one of the rows but drop othe other....AND I would want to keep the row with 1.8 under AAdams since it occurs in the second month (and other occurrences in other students in months after the first). Hope this makes sense.
Upvotes: 2
Views: 3860
Reputation: 153
Original Question (as stated above):
Below is a sample of a larger dictionary which became dfdata
data ={'StudentId':['AAdams','AAdams','AAdams','AAdams','AAdams','AAdams',
'BBrooks','BBrooks','BBrooks','BBrooks','BBrooks',],
'year':[2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016],
'month':[11,11,11,11,12,12,12,12,12,1,1],
'day':[15,16,21,23,1,2,15,18,30,5,7],
'BookLevel':[1.5,1.5,1.2,1.4,1.7,1.8,3.2,3.1,3.8,3.3,3.4]}
I want to drop the first occurrence of the max value from the first month of each student. If there is a second occurrence I want to keep that occurrence. I did not want to change any month other than the first month for each student
This is my new solution:
df =dfdata.sort_values(by=['StudentId','year','month','BookLevel'],ascending = [True,True,True,False])
This brought the highest BookLevel for the first month to the top of each group of StudentID, year, month.
Then I did a groupby using cumcount. This placed a zero next to each of the highest booklevel for each student in the first month.
df1 = (df.groupby('StudentId').cumcount())
Then I did a groupby selecting all the rows without a zero
df2 = df(df.groupby('StudentId').cumcount() !=0])
Mission accomplished! Original try is a great example of looking at incremental progress but losing sight of the original goal.
Upvotes: 2
Reputation: 2847
Try this
# sort and reindex
df = df.sort('col').reset_index()
#slice to first occurrence of your value
df.loc[: df[(df['col'] == 'row')].index[0], :]
Upvotes: 0