Zephyr
Zephyr

Reputation: 1352

using duplicates values from one column to remove entire row in pandas dataframe

I have the data in the .csv file uploaded in the following link

Click here for the data

In this file, I have the following columns:

Team    Group    Model   SimStage  Points  GpWinner  GpRunnerup 3rd   4th

There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)

I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage. Since it is slightly difficult to explain using words alone, I attached a picture here.

Sample pic that shows which rows to remove

In this picture, the row with highlighted in red boxes will be be removed.

I used df.duplicates() but it does not work.

Upvotes: 1

Views: 997

Answers (2)

Dillon
Dillon

Reputation: 999

It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas

Create the dataframe and call it df

data = {'Team': {0: 'Brazil',  1: 'Brazil',  2: 'Brazil',  3: 'Brazil',  4: 'Brazil',  5: 'Brazil',  6: 'Brazil',  7: 'Brazil',  8: 'Brazil',  9: 'Brazil'},
 'Group': {0: 'Group E',  1: 'Group E',  2: 'Group E',  3: 'Group E',  4: 'Group E',  5: 'Group E',  6: 'Group E',  7: 'Group E',  8: 'Group E',  9: 'Group E'},
 'Model': {0: 'ELO',  1: 'ELO',  2: 'ELO',  3: 'ELO',  4: 'ELO',  5: 'ELO',  6: 'ELO',  7: 'ELO',  8: 'ELO',  9: 'ELO'},
 'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
 'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
 'GpWinner': {0: 0.2,  1: 0.2,  2: 0.2,  3: 0.2,  4: 0.2,  5: 0.0,  6: 0.2,  7: 0.2,  8: 0.2,  9: 0.0},
 'GpRunnerup': {0: 0.0,  1: 0.0,  2: 0.0,  3: 0.0,  4: 0.0,  5: 0.2,  6: 0.0,  7: 0.0,  8: 0.0,  9: 0.2},
 '3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
 '4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}

df = pd.DataFrame(data)

# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']

Method 1

# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')

# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')

# Output the dataframe in the orginal order
df[columns_order]

Out[]: 
     Team    Group Model  SimStage  Points  GpWinner  GpRunnerup  3rd  4th
0  Brazil  Group E   ELO         0       4       0.2         0.0    0    0
1  Brazil  Group E   ELO         1       4       0.2         0.0    0    0
2  Brazil  Group E   ELO         2       4       0.2         0.0    0    0
3  Brazil  Group E   ELO         3       4       0.2         0.0    0    0
4  Brazil  Group E   ELO         4       4       0.2         0.0    0    0

Method 2

df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]: 
     Team    Group Model  SimStage  Points  GpWinner  GpRunnerup  3rd  4th
0  Brazil  Group E   ELO         0       4       0.2         0.0    0    0
2  Brazil  Group E   ELO         1       4       0.2         0.0    0    0
4  Brazil  Group E   ELO         2       4       0.2         0.0    0    0
7  Brazil  Group E   ELO         3       4       0.2         0.0    0    0
8  Brazil  Group E   ELO         4       4       0.2         0.0    0    0

Upvotes: 2

Ankur Sinha
Ankur Sinha

Reputation: 6669

I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.

import pandas as pd

namesDf = pd.DataFrame() 
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]

Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.

namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)

This will make my dataframe look like this, notice the change in Sim Stage with value 3:

     Team  SimStage  Points
0  Brazil         0       4
1  Brazil         0       4
2  Brazil         1       4
3  Brazil         1       4
4  Brazil         2       4
5  Brazil         2       1
6  Brazil         3       4
7  Brazil         3       2
8  Brazil         4       4
9  Brazil         4       1

And now I remove the duplicates by keeping the first instance of every team and sim stage.

namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')

Final result:

     Team  SimStage  Points
0  Brazil         0       4
2  Brazil         1       4
4  Brazil         2       4
6  Brazil         3       4
8  Brazil         4       4

Upvotes: 1

Related Questions