Reputation: 1352
I have the data in the .csv file uploaded in the following link
In this file, I have the following columns:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)
I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage. Since it is slightly difficult to explain using words alone, I attached a picture here.
In this picture, the row with highlighted in red boxes will be be removed.
I used df.duplicates()
but it does not work.
Upvotes: 1
Views: 997
Reputation: 999
It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first
aggregation function in pandas
Create the dataframe and call it df
data = {'Team': {0: 'Brazil', 1: 'Brazil', 2: 'Brazil', 3: 'Brazil', 4: 'Brazil', 5: 'Brazil', 6: 'Brazil', 7: 'Brazil', 8: 'Brazil', 9: 'Brazil'},
'Group': {0: 'Group E', 1: 'Group E', 2: 'Group E', 3: 'Group E', 4: 'Group E', 5: 'Group E', 6: 'Group E', 7: 'Group E', 8: 'Group E', 9: 'Group E'},
'Model': {0: 'ELO', 1: 'ELO', 2: 'ELO', 3: 'ELO', 4: 'ELO', 5: 'ELO', 6: 'ELO', 7: 'ELO', 8: 'ELO', 9: 'ELO'},
'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
'GpWinner': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.0, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.0},
'GpRunnerup': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.2, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.2},
'3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}
df = pd.DataFrame(data)
# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']
# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')
# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')
# Output the dataframe in the orginal order
df[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
1 Brazil Group E ELO 1 4 0.2 0.0 0 0
2 Brazil Group E ELO 2 4 0.2 0.0 0 0
3 Brazil Group E ELO 3 4 0.2 0.0 0 0
4 Brazil Group E ELO 4 4 0.2 0.0 0 0
df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
2 Brazil Group E ELO 1 4 0.2 0.0 0 0
4 Brazil Group E ELO 2 4 0.2 0.0 0 0
7 Brazil Group E ELO 3 4 0.2 0.0 0 0
8 Brazil Group E ELO 4 4 0.2 0.0 0 0
Upvotes: 2
Reputation: 6669
I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.
import pandas as pd
namesDf = pd.DataFrame()
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]
Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.
namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)
This will make my dataframe look like this, notice the change in Sim Stage with value 3:
Team SimStage Points
0 Brazil 0 4
1 Brazil 0 4
2 Brazil 1 4
3 Brazil 1 4
4 Brazil 2 4
5 Brazil 2 1
6 Brazil 3 4
7 Brazil 3 2
8 Brazil 4 4
9 Brazil 4 1
And now I remove the duplicates by keeping the first instance of every team and sim stage.
namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')
Final result:
Team SimStage Points
0 Brazil 0 4
2 Brazil 1 4
4 Brazil 2 4
6 Brazil 3 4
8 Brazil 4 4
Upvotes: 1