Bluetail
Bluetail

Reputation: 1291

How to separate a string with 2 uppercases and a space with regex in pandas dataframe?

I have a dataframe column, teams, where I am trying to split the team name, 'CubsWhite Sox', into two parts, 'Cubs' and 'White Sox'.

import pandas as pd
import re
data = [{'teams':'CubsWhite Sox','area':'Chicago','league': 'MLB'}, {'teams': 'Red Sox','area':'Boston', 'league': 'MLB'}, {'teams': 'Blue Jay','area':'Toronto', 'league': 'MLB'}] 

df = pd.DataFrame(data) 
df

so far I could only achieve this result.

df["team"] = df.apply(lambda x: re.findall(r"[A-Z][^A-Z]*(?:\s[A-Z][^A-Z]*)", x["teams"]), axis=1)
df
    teams           area    league   team
0   CubsWhite Sox   Chicago MLB      [White Sox]
1   Red Sox         Boston  MLB      [Red Sox]
2   Blue Jay        Toronto MLB      [Blue Jay]

Also after white, red and blue there are two spaces as I have discovered from here.

df["team"] = df.apply(lambda x: re.findall(r"[A-Z0-9][^A-Z]*", x["teams"]), axis=1)
df
    teams           area    league  team
0   CubsWhite Sox   Chicago MLB     [Cubs, White , Sox]
1   Red Sox         Boston  MLB     [Red , Sox]
2   Blue Jay        Toronto MLB     [Blue , Jay]

which I can easily remove with

df['teams'] = df['teams'].str.replace(r' +', '')

Can you help me to split these team names like this, please using re.findall?

df
    teams           area    league  team
0   CubsWhite Sox   Chicago MLB     [Cubs, White Sox]
1   Red Sox         Boston  MLB     [Red Sox]
2   Blue Jay        Toronto MLB     [Blue Jay]

thank you!

Upvotes: 3

Views: 86

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You can use

df['team'] = df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')

See the regex demo. Details:

  • [A-Z][a-z]* - an uppercase letter followed with any zero or more lowercase letters
  • (?:\s+[A-Z][a-z]*)? - an optional non-capturing group that matches
    • \s+ - one or more whitespaces
    • [A-Z][a-z]* - an uppercase letter followed with any zero or more lowercase letters.

Pandas test:

>>> df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
0    [Cubs, White Sox]
1            [Red Sox]
2           [Blue Jay]
Name: teams, dtype: object

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150785

You can try:

df['new_teams'] = (df.teams.str.extract('([A-Z][a-z]+)?([A-Z]\w+\s+\w+)')
                     .apply(lambda x: list(x.dropna()), axis=1)
                  )

Output:

           teams     area league          new_teams
0  CubsWhite Sox  Chicago    MLB  [Cubs, White Sox]
1        Red Sox   Boston    MLB          [Red Sox]
2       Blue Jay  Toronto    MLB         [Blue Jay]

Upvotes: 1

Related Questions