Reputation: 255
I'm scraping some data on college basketball teams from ESPN's BPI page (http://www.espn.com/mens-college-basketball/bpi/_/view/resume) to store in a pandas dataframe. When I read the html table into a dataframe, the abbreviated school name is appended to the full school name. E.g I have several strings that looks like this: "North CarolinaUNC".
How can I remove the UNC from the end of the string? I tried the below regex to match characters at the end of strings:
name = "North CarolinaUNC"
name = re.sub(r"\z[A-Z]","", name)
but it won't work for schools whose names are made up of two words. Is there a way to write a rule that removes uppercase characters from a string when they are preceded by a lowercase character?
Upvotes: 1
Views: 622
Reputation: 140256
use $
to match the end of the string, and non-matching lookbehind to check if the uppercase letters come after lowercase letters:
import re
name = "North CarolinaUNC"
name = re.sub(r"(?<=[a-z])[A-Z]+$","", name)
results in North Carolina
all right.
And with that expression, "North Carolina UNC"
stays unmodified because the uppercase letters, even if at the end of the string, do not come after a lowercase letter.
Upvotes: 1