Reputation: 43
Here is an example of my dataframe:
df = pd.DataFrame([['In', 'Age', 'Nat.'],
['Jakub Kiwior', 22, 'Poland'],
['Leandro Trossard', 28, 'Belgium'],
['Jorginho', 31, 'Italy'],
['Out', 'Age', 'Nat.'],
['Jhon Durán', 19, 'Colombia'],
['In', 'Age', 'Nat.'],
['Jhon Durán', 19, 'Colombia'],
['Álex Moreno', 29, 'Spain'],
['Out', 'Age', 'Nat.'],
['Leandro Trossard', 28, 'Belgium'],
['Jorginho', 31, 'Italy'],
['In', 'Age', 'Nat.'],
['Out', 'Age', 'Nat.'],
['In', 'Age', 'Nat.'],
], columns=['Player', 'Age', 'Nat.'])
My desired output is a dataframe that removes duplicate rows if the row above (not necessarily directly above) has the value 'Out' in the 'Player' column.
For example, the desired output would remove the first "Jhon Durán" row, and the second "Leandro Trossard" and "Jorginho" rows, since these are the rows with "Out" above them and not "In".
Is this possible to achieve with pandas?
Upvotes: 1
Views: 73
Reputation: 8768
I believe this is what youre looking for:
df.loc[df['Player'].isin(['In','Out']).cumsum().mod(2).eq(1) | df['Player'].eq('Out')]
or
s = df['Player'].eq('In')
e = df['Player'].eq('Out')
df.loc[s.where(s|e).ffill().fillna(False) | e]
or
s = df['Player'].eq('In')
e = df['Player'].eq('Out')
df.loc[e.iloc[::-1].groupby(s.cumsum()).cummax()]
Output:
Player Age Nat.
0 In Age Nat.
1 Jakub Kiwior 22 Poland
2 Leandro Trossard 28 Belgium
3 Jorginho 31 Italy
4 Out Age Nat.
6 In Age Nat.
7 Jhon Durán 19 Colombia
8 Álex Moreno 29 Spain
9 Out Age Nat.
12 In Age Nat.
13 Out Age Nat.
14 In Age Nat.
Upvotes: 0
Reputation: 261860
Use boolean indexing:
# keep rows that are after In and drop those after Out
m1 = df['Player'].map({'In': True, 'Out': False}).ffill()
# keep rows with In/Out
m2 = df['Player'].isin(['In', 'Out'])
out = df.loc[m1|m2]
Output:
Player Age Nat.
0 In Age Nat.
1 Jakub Kiwior 22 Poland
2 Leandro Trossard 28 Belgium
3 Jorginho 31 Italy
4 Out Age Nat.
6 In Age Nat.
7 Jhon Durán 19 Colombia
8 Álex Moreno 29 Spain
9 Out Age Nat.
12 In Age Nat.
13 Out Age Nat.
14 In Age Nat.
Upvotes: 2
Reputation: 34
It would be more beneficial to add an additional column "out/in" or another way to store your data. If we look at the header "player" it should only have players in it; "out" or "in" is not a relevant data entry for that column. It's really important to make sure we are creating clean and easily manipulated data.
Once we add a new column "out/in" we can easily change the player's status and drop players that are out.
Here is the above applied. Notice the addition of the status
column:
df = pd.DataFrame([['Jakub Kiwior', 22, 'Poland','in'],
['Leandro Trossard', 28, 'Belgium','in'],
['Jorginho', 31, 'Italy','in'],
['Jhon Durán', 19, 'Colombia', 'out'],
['Álex Moreno', 29, 'Spain','in'],
['Leandro Trossard', 28, 'Belgium', 'out'],
['Jorginho', 31, 'Italy','in'],
], columns=['Player', 'Age', 'Nat.', 'status'])
Here is the original data frame:
if we want to change a player from in
to out
:
player_name = 'Leandro Trossard'
df.loc[df['Player'] == player_name, 'status'] = 'out'
Which gives us:
if we want to drop all players that are out from the dataframe:
in_players_df = df[df['status']!='out']
Which gives us:
This creates a new df with only 'in' players. This allows us to have access to a filtered view (
in_players_df
), and a view of all players regardless of in/out (df
). If we used df=...
it would change the original dataframe. Just depends what you're after :)
Upvotes: 0
Reputation: 16
You could use Pandas shift method to help achieve this.
df['previousPlayer'] = df['Player'].shift(1)
df
Player Age Nat. previousPlayer
0 In Age Nat. NaN
1 Jakub Kiwior 22 Poland In
2 Leandro Trossard 28 Belgium Jakub Kiwior
3 Jorginho 31 Italy Leandro Trossard
4 Out Age Nat. Jorginho
5 Jhon Durán 19 Colombia Out
6 In Age Nat. Jhon Durán
7 Jhon Durán 19 Colombia In
8 Álex Moreno 29 Spain Jhon Durán
9 Out Age Nat. Álex Moreno
10 Leandro Trossard 28 Belgium Out
11 Jorginho 31 Italy Leandro Trossard
12 In Age Nat. Jorginho
13 Out Age Nat. In
14 In Age Nat. Out
Then simply filter out any values in the new column with the word of your choice:
df = df[df.previousPlayer != 'Out'].drop('previousPlayer', axis=1)
print(df)
Player Age Nat.
0 In Age Nat.
1 Jakub Kiwior 22 Poland
2 Leandro Trossard 28 Belgium
3 Jorginho 31 Italy
4 Out Age Nat.
6 In Age Nat.
7 Jhon Durán 19 Colombia
8 Álex Moreno 29 Spain
9 Out Age Nat.
11 Jorginho 31 Italy
12 In Age Nat.
13 Out Age Nat.
Upvotes: 0