Reputation: 43
Here is an example of my dataframe:
df = pd.DataFrame([['Arsenal FC', '', ''],
['In', 'Age', 'Nat.'],
['Leandro Trossard', 28, 'Belgium'],
['Jakub Kiwior', 22, 'Poland'],
['Jorginho', 32, 'Italy'],
['Chelsea FC', '', ''],
['In', 'Age', 'Nat.'],
['Enzo Fernández ', 22, 'Argentina'],
['Mykhaylo Mudryk', 22, 'Ukraine'],
], columns=['Player', 'Age', 'Nat.'])
I want to create a new column "Club"
which takes the string value of the cell in the "Player"
and attaches it to the player below.
The tricky part is getting the right clubs
assigned to the right players
This is my desired output:
df = pd.DataFrame([['In', 'Age', 'Nat.'],
['Leandro Trossard', 28, 'Belgium', 'Arsenal FC'],
['Jakub Kiwior', 22, 'Poland', 'Arsenal FC'],
['Jorginho', 32, 'Italy', 'Arsenal FC'],
['In', 'Age', 'Nat.'],
['Enzo Fernández ', 22, 'Argentina', 'Chelsea FC'],
['Mykhaylo Mudryk', 22, 'Ukraine', 'Chelsea FC'],
], columns=['Player', 'Age', 'Nat.', 'Club'])
I can't find another question that relates to this problem. Is this possible in python?
Upvotes: 1
Views: 74
Reputation: 260580
One option using boolean masks with mask
and ffill
:
# which rows are empty string on Age?
m1 = df['Age'].ne('')
# which row are not internal headers?
m2 = df['Player'].ne('Player')
out = df[m1&m2].assign(Club=df.loc[m2, 'Player'].mask(m1).ffill())
Output:
Player Age Nat. Club
2 Leandro Trossard 28 Belgium Arsenal FC
3 Jakub Kiwior 22 Poland Arsenal FC
4 Jorginho 32 Italy Arsenal FC
7 Enzo Fernández 22 Argentina Chelsea FC
8 Mykhaylo Mudryk 22 Ukraine Chelsea FC
Intermediates:
Player m1 mask ffill
0 Arsenal FC False Arsenal FC Arsenal FC
2 Leandro Trossard True NaN Arsenal FC
3 Jakub Kiwior True NaN Arsenal FC
4 Jorginho True NaN Arsenal FC
5 Chelsea FC False Chelsea FC Chelsea FC
7 Enzo Fernández True NaN Chelsea FC
8 Mykhaylo Mudryk True NaN Chelsea FC
# which rows are empty string on Age?
m1 = df['Age'].ne('')
# which row are not internal headers?
m2 = df['Player'].ne('In')
out = df[m1].assign(Club=df.loc[m2, 'Player'].mask(m1).ffill())
Output:
Player Age Nat. Club
1 In Age Nat. NaN
2 Leandro Trossard 28 Belgium Arsenal FC
3 Jakub Kiwior 22 Poland Arsenal FC
4 Jorginho 32 Italy Arsenal FC
6 In Age Nat. NaN
7 Enzo Fernández 22 Argentina Chelsea FC
8 Mykhaylo Mudryk 22 Ukraine Chelsea FC
Upvotes: 2
Reputation: 87
Edited:
df = pd.DataFrame([['Arsenal FC', '', ''],
['In', 'Age', 'Nat.'],
['Leandro Trossard', 28, 'Belgium'],
['Jakub Kiwior', 22, 'Poland'],
['Jorginho', 32, 'Italy'],
['Chelsea FC', '', ''],
['In', 'Age', 'Nat.'],
['Enzo Fernández ', 22, 'Argentina'],
['Mykhaylo Mudryk', 22, 'Ukraine'],
], columns=['Player', 'Age', 'Nat.'])
clubs = []
current_club = None
for i, row in df.iterrows():
if row['Player'] in ['Arsenal FC', 'Chelsea FC']:
current_club = row['Player']
elif row['Player'] == 'In':
continue
else:
clubs.append(current_club)
df['Club'] = clubs
print(df)
Output:
Player Age Nat. Club
0 Arsenal FC Arsenal FC
1 In Age Nat. NaN
2 Leandro Trossard 28 Belgium Arsenal FC
3 Jakub Kiwior 22 Poland Arsenal FC
4 Jorginho 32 Italy Arsenal FC
5 Chelsea FC Chelsea FC
6 In Age Nat. NaN
7 Enzo Fernández 22 Argentina Chelsea FC
8 Mykhaylo Mudryk 22 Ukraine Chelsea FC
Edit 2: Multiple club names
clubs = ['Arsenal FC', 'Chelsea FC', 'Other Club 1', 'Other Club 2', ..., 'Other Club n']
def get_club(row, clubs):
if row['Player'] in clubs:
return row['Player']
else:
return ''
df['Club'] = ''
club = ''
for index, row in df.iterrows():
if row['Player'] in clubs:
club = row['Player']
else:
df.at[index, 'Club'] = club
df = df[df['Club'] != ''].reset_index(drop=True)
df['Club'] = df.apply(lambda x: get_club(x, clubs), axis=1)
Upvotes: 1