Reputation: 1792
I am playing around with some NFL data and I have a column in a dataframe that looks like:
0 Lamar JacksonL. Jackson BAL
1 Patrick Mahomes IIP. Mahomes KC
2 Dak PrescottD. Prescott DAL
3 Josh AllenJ. Allen BUF
4 Russell WilsonR. Wilson SEA
There are 3 bits of information in each cell - FullName
, ShortName
and Team
whihc i am hoping to create new columns for.
Expected output:
FullName ShortName Team
0 Lamar Jackson L. Jackson BAL
1 Patrick Mahomes II P. Mahomes KC
2 Dak Prescott D. Prescott DAL
3 Josh Allen J. Allen BUF
4 Russell Wilson R. Wilson SEA
Ive managed to get the Team
but I'm not quite sure how to do all three in the one line.
I was thinking of splitting the string by finding the previous character from the fullstop
however there are some names that appear such as:
Anthony McFarland Jr.A. McFarland PIT
which have multiple full stops.
Anyone have an idea of the best way to approach this? Thanks!
Upvotes: 0
Views: 997
Reputation: 13417
The pandas Series str.extract method is what you're looking for. This regex works for all of the cases you've presented, though there may be some other edge cases.
df = pd.DataFrame({
"bad_col": ["Lamar JacksonL. Jackson BAL", "Patrick Mahomes IIP. Mahomes KC",
"Dak PrescottD. Prescott DAL", "Josh AllenJ. Allen BUF",
"Josh AllenJ. Allen SEA", "Anthony McFarland Jr.A. McFarland PIT"],
})
print(df)
bad_col
0 Lamar JacksonL. Jackson BAL
1 Patrick Mahomes IIP. Mahomes KC
2 Dak PrescottD. Prescott DAL
3 Josh AllenJ. Allen BUF
4 Josh AllenJ. Allen SEA
5 Anthony McFarland Jr.A. McFarland PIT
pattern = r"(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)"
new_df = df["bad_col"].str.extract(pattern, expand=True)
print(new_df)
full_name short_name team
0 Lamar Jackson L. Jackson BAL
1 Patrick Mahomes II P. Mahomes KC
2 Dak Prescott D. Prescott DAL
3 Josh Allen J. Allen BUF
4 Josh Allen J. Allen SEA
5 Anthony McFarland Jr. A. McFarland PIT
Breaking down that regex:
(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)
(?P<full_name>.+)(?=[A-Z]\.)
captures any letter UNTIL we see a capital letter followed by a fullstop/period we use a lookahead (?=...) to not consume the capital letter and fullstop because this part of the string belongs to the short name
(?P<short_name>[A-Z]\.\s.*.)\s
captures a capital letter (the players first initial), then a fullstop (the period that comes after their first initial), then a space (between first initial and last name), then all characters until we hit a space (the players last name). The space is not included in the capture group.
(?P<team>[A-Z]+)
capture all of the remaining capital letters in the string (ends up being the players team)
You've probably noticed that I've used named capture groups as denoted by the (?Ppattern) structure. In pandas, the name of the capture group becomes the name of the column and whatever is captured in that group becomes the values in that column.
Now to join the new dataframe back to our original one to come full circle:
df = df.join(new_df)
print(df)
bad_col full_name short_name \
0 Lamar JacksonL. Jackson BAL Lamar Jackson L. Jackson
1 Patrick Mahomes IIP. Mahomes KC Patrick Mahomes II P. Mahomes
2 Dak PrescottD. Prescott DAL Dak Prescott D. Prescott
3 Josh AllenJ. Allen BUF Josh Allen J. Allen
4 Josh AllenJ. Allen SEA Josh Allen J. Allen
5 Anthony McFarland Jr.A. McFarland PIT Anthony McFarland Jr. A. McFarland
team
0 BAL
1 KC
2 DAL
3 BUF
4 SEA
5 PIT
Upvotes: 2
Reputation: 1000
import pandas as pd
import numpy as np
df = pd.DataFrame({'players':['Lamar JacksonL. Jackson BAL', 'Patrick Mahomes IIP. Mahomes KC',
'Anthony McFarland Jr.A. McFarland PIT']})
def splitName(name):
last_period_pos = np.max(np.where(np.array(list(name)) == '.'))
full_name = name[:(last_period_pos - 1)]
short_name_team = name[(last_period_pos - 1):]
team_pos = np.max(np.where(np.array(list(short_name_team)) == ' '))
short_name = short_name_team[:team_pos]
team = short_name_team[(team_pos + 1):]
return full_name, short_name, team
df['full_name'], df['short_name'], df['team'] = zip(*df.players.apply(splitName))
Upvotes: 0
Reputation: 293
This might help.
import re
name = 'Anthony McFarland Jr.A. McFarland PIT'
short_name = re.findall(r'(\w\.\s[\w]+)\s[\w]{3}', name)[0]
full_name = name.replace(short_name, "")[:-4]
team = name[-3:]
print(short_name)
print(full_name)
print(team)
Ouput:
A. McFarland
Anthony McFarland Jr.
PIT
Upvotes: 0
Reputation: 107
My guess is that short names would not contain fullstops. So, you can search for the first fullstop from the end of the line. So, from one character before that fullstop until the first space is your short name. Anything coming before one letter before that fullstop is going to be FullName.
Upvotes: 0