Reputation: 1291
I have a dataframe column, teams, where I am trying to split the team name, 'CubsWhite Sox', into two parts, 'Cubs' and 'White Sox'.
import pandas as pd
import re
data = [{'teams':'CubsWhite Sox','area':'Chicago','league': 'MLB'}, {'teams': 'Red Sox','area':'Boston', 'league': 'MLB'}, {'teams': 'Blue Jay','area':'Toronto', 'league': 'MLB'}]
df = pd.DataFrame(data)
df
so far I could only achieve this result.
df["team"] = df.apply(lambda x: re.findall(r"[A-Z][^A-Z]*(?:\s[A-Z][^A-Z]*)", x["teams"]), axis=1)
df
teams area league team
0 CubsWhite Sox Chicago MLB [White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
Also after white, red and blue there are two spaces as I have discovered from here.
df["team"] = df.apply(lambda x: re.findall(r"[A-Z0-9][^A-Z]*", x["teams"]), axis=1)
df
teams area league team
0 CubsWhite Sox Chicago MLB [Cubs, White , Sox]
1 Red Sox Boston MLB [Red , Sox]
2 Blue Jay Toronto MLB [Blue , Jay]
which I can easily remove with
df['teams'] = df['teams'].str.replace(r' +', '')
Can you help me to split these team names like this, please using re.findall?
df
teams area league team
0 CubsWhite Sox Chicago MLB [Cubs, White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
thank you!
Upvotes: 3
Views: 86
Reputation: 627082
You can use
df['team'] = df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
See the regex demo. Details:
[A-Z][a-z]*
- an uppercase letter followed with any zero or more lowercase letters(?:\s+[A-Z][a-z]*)?
- an optional non-capturing group that matches
\s+
- one or more whitespaces[A-Z][a-z]*
- an uppercase letter followed with any zero or more lowercase letters.Pandas test:
>>> df['teams'].str.findall(r'[A-Z][a-z]*(?:\s+[A-Z][a-z]*)?')
0 [Cubs, White Sox]
1 [Red Sox]
2 [Blue Jay]
Name: teams, dtype: object
Upvotes: 3
Reputation: 150785
You can try:
df['new_teams'] = (df.teams.str.extract('([A-Z][a-z]+)?([A-Z]\w+\s+\w+)')
.apply(lambda x: list(x.dropna()), axis=1)
)
Output:
teams area league new_teams
0 CubsWhite Sox Chicago MLB [Cubs, White Sox]
1 Red Sox Boston MLB [Red Sox]
2 Blue Jay Toronto MLB [Blue Jay]
Upvotes: 1