SOK
SOK

Reputation: 1792

How to split a string in a pandas dataframe with capital letters

I am playing around with some NFL data and I have a column in a dataframe that looks like:

0         Lamar JacksonL. Jackson BAL
1     Patrick Mahomes IIP. Mahomes KC
2         Dak PrescottD. Prescott DAL
3              Josh AllenJ. Allen BUF
4         Russell WilsonR. Wilson SEA

There are 3 bits of information in each cell - FullName, ShortName and Team whihc i am hoping to create new columns for.

Expected output:

         FullName                ShortName        Team
0         Lamar Jackson           L. Jackson        BAL
1         Patrick Mahomes II      P. Mahomes        KC
2         Dak Prescott            D. Prescott       DAL
3         Josh Allen              J. Allen          BUF
4         Russell Wilson          R. Wilson         SEA

Ive managed to get the Team but I'm not quite sure how to do all three in the one line.

I was thinking of splitting the string by finding the previous character from the fullstop however there are some names that appear such as:

Anthony McFarland Jr.A. McFarland PIT

which have multiple full stops.

Anyone have an idea of the best way to approach this? Thanks!

Upvotes: 0

Views: 997

Answers (4)

Cameron Riddell
Cameron Riddell

Reputation: 13417

The pandas Series str.extract method is what you're looking for. This regex works for all of the cases you've presented, though there may be some other edge cases.

df = pd.DataFrame({
    "bad_col": ["Lamar JacksonL. Jackson BAL", "Patrick Mahomes IIP. Mahomes KC", 
                "Dak PrescottD. Prescott DAL", "Josh AllenJ. Allen BUF", 
                "Josh AllenJ. Allen SEA", "Anthony McFarland Jr.A. McFarland PIT"],
})

print(df)
                                 bad_col
0            Lamar JacksonL. Jackson BAL
1        Patrick Mahomes IIP. Mahomes KC
2            Dak PrescottD. Prescott DAL
3                 Josh AllenJ. Allen BUF
4                 Josh AllenJ. Allen SEA
5  Anthony McFarland Jr.A. McFarland PIT


pattern = r"(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)"
new_df = df["bad_col"].str.extract(pattern, expand=True)
print(new_df)
               full_name    short_name team
0          Lamar Jackson    L. Jackson  BAL
1     Patrick Mahomes II    P. Mahomes   KC
2           Dak Prescott   D. Prescott  DAL
3             Josh Allen      J. Allen  BUF
4             Josh Allen      J. Allen  SEA
5  Anthony McFarland Jr.  A. McFarland  PIT

Breaking down that regex:

(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)
  • (?P<full_name>.+)(?=[A-Z]\.) captures any letter UNTIL we see a capital letter followed by a fullstop/period we use a lookahead (?=...) to not consume the capital letter and fullstop because this part of the string belongs to the short name

  • (?P<short_name>[A-Z]\.\s.*.)\s captures a capital letter (the players first initial), then a fullstop (the period that comes after their first initial), then a space (between first initial and last name), then all characters until we hit a space (the players last name). The space is not included in the capture group.

  • (?P<team>[A-Z]+) capture all of the remaining capital letters in the string (ends up being the players team)

You've probably noticed that I've used named capture groups as denoted by the (?Ppattern) structure. In pandas, the name of the capture group becomes the name of the column and whatever is captured in that group becomes the values in that column.

Now to join the new dataframe back to our original one to come full circle:

df = df.join(new_df)
print(df)
                                 bad_col              full_name    short_name  \
0            Lamar JacksonL. Jackson BAL          Lamar Jackson    L. Jackson   
1        Patrick Mahomes IIP. Mahomes KC     Patrick Mahomes II    P. Mahomes   
2            Dak PrescottD. Prescott DAL           Dak Prescott   D. Prescott   
3                 Josh AllenJ. Allen BUF             Josh Allen      J. Allen   
4                 Josh AllenJ. Allen SEA             Josh Allen      J. Allen   
5  Anthony McFarland Jr.A. McFarland PIT  Anthony McFarland Jr.  A. McFarland   

  team  
0  BAL  
1   KC  
2  DAL  
3  BUF  
4  SEA  
5  PIT  

Upvotes: 2

Rajesh Bhat
Rajesh Bhat

Reputation: 1000

import pandas as pd
import numpy as np

df = pd.DataFrame({'players':['Lamar JacksonL. Jackson BAL', 'Patrick Mahomes IIP. Mahomes KC', 
                         'Anthony McFarland Jr.A. McFarland PIT']})

def splitName(name):
    last_period_pos = np.max(np.where(np.array(list(name)) == '.'))
    full_name = name[:(last_period_pos - 1)]
    short_name_team = name[(last_period_pos - 1):]
    team_pos = np.max(np.where(np.array(list(short_name_team)) == ' '))
    short_name = short_name_team[:team_pos]
    team = short_name_team[(team_pos + 1):]
    return full_name, short_name, team

df['full_name'], df['short_name'], df['team'] = zip(*df.players.apply(splitName))

Upvotes: 0

Md Zafar Hassan
Md Zafar Hassan

Reputation: 293

This might help.

import re

name = 'Anthony McFarland Jr.A. McFarland PIT'

short_name = re.findall(r'(\w\.\s[\w]+)\s[\w]{3}', name)[0]
full_name = name.replace(short_name, "")[:-4]
team = name[-3:]

print(short_name)
print(full_name)
print(team)

Ouput:

A. McFarland
Anthony McFarland Jr.
PIT

Upvotes: 0

user3314148
user3314148

Reputation: 107

My guess is that short names would not contain fullstops. So, you can search for the first fullstop from the end of the line. So, from one character before that fullstop until the first space is your short name. Anything coming before one letter before that fullstop is going to be FullName.

Upvotes: 0

Related Questions