using duplicates values from one column to remove entire row in pandas dataframe

Question

I have the data in the .csv file uploaded in the following link

In this file, I have the following columns:

Team    Group    Model   SimStage  Points  GpWinner  GpRunnerup 3rd   4th

There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)

I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage. Since it is slightly difficult to explain using words alone, I attached a picture here.

In this picture, the row with highlighted in red boxes will be be removed.

I used df.duplicates() but it does not work.

Dillon · Accepted Answer

It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas

Create the dataframe and call it df

data = {'Team': {0: 'Brazil',  1: 'Brazil',  2: 'Brazil',  3: 'Brazil',  4: 'Brazil',  5: 'Brazil',  6: 'Brazil',  7: 'Brazil',  8: 'Brazil',  9: 'Brazil'},
 'Group': {0: 'Group E',  1: 'Group E',  2: 'Group E',  3: 'Group E',  4: 'Group E',  5: 'Group E',  6: 'Group E',  7: 'Group E',  8: 'Group E',  9: 'Group E'},
 'Model': {0: 'ELO',  1: 'ELO',  2: 'ELO',  3: 'ELO',  4: 'ELO',  5: 'ELO',  6: 'ELO',  7: 'ELO',  8: 'ELO',  9: 'ELO'},
 'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
 'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
 'GpWinner': {0: 0.2,  1: 0.2,  2: 0.2,  3: 0.2,  4: 0.2,  5: 0.0,  6: 0.2,  7: 0.2,  8: 0.2,  9: 0.0},
 'GpRunnerup': {0: 0.0,  1: 0.0,  2: 0.0,  3: 0.0,  4: 0.0,  5: 0.2,  6: 0.0,  7: 0.0,  8: 0.0,  9: 0.2},
 '3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
 '4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}

df = pd.DataFrame(data)

# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']

Method 1

# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')

# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')

# Output the dataframe in the orginal order
df[columns_order]

Out[]: 
     Team    Group Model  SimStage  Points  GpWinner  GpRunnerup  3rd  4th
0  Brazil  Group E   ELO         0       4       0.2         0.0    0    0
1  Brazil  Group E   ELO         1       4       0.2         0.0    0    0
2  Brazil  Group E   ELO         2       4       0.2         0.0    0    0
3  Brazil  Group E   ELO         3       4       0.2         0.0    0    0
4  Brazil  Group E   ELO         4       4       0.2         0.0    0    0

Method 2

df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]: 
     Team    Group Model  SimStage  Points  GpWinner  GpRunnerup  3rd  4th
0  Brazil  Group E   ELO         0       4       0.2         0.0    0    0
2  Brazil  Group E   ELO         1       4       0.2         0.0    0    0
4  Brazil  Group E   ELO         2       4       0.2         0.0    0    0
7  Brazil  Group E   ELO         3       4       0.2         0.0    0    0
8  Brazil  Group E   ELO         4       4       0.2         0.0    0    0

using duplicates values from one column to remove entire row in pandas dataframe

Answers (2)

Method 1

Method 2

Related Questions