Reputation: 1030
[Answered first part, please scroll for second question edit]
Currently coding a web scraper in python. I have the following example string:
Columbus Blue Jackets at Buffalo Sabres - 10/09/2014
I want to split it so that I have [Columbus Blue Jackets, Buffalo Sabres, 10/09/2014]
I read up on regular expressions including a few answers on this site but can't figure out how to format my particular example. The best I could come up with was something like this, although it doesn't work.
re.split('\w+\s\w\w\s\w+\s\.\s\w+', teams)
My second try is:
re.split("\w+\s'at'\s\w+\s'-'\s\w+", teams)
, but I'm not sure if you can even enter exact strings like ['at' and '-'] inside a regex function.
Please let me know where I'm going wrong with the regex function or if there's another way to delimit my particular example in python.
(Also note that the team names can be either 2 or 3 words for each team, eg. Montreal Canadiens at Buffalo Sabres
)
EDIT:
re.split(r"\s+at\s+|\s+-\s+", teams)
seems to do the trick. However I now have a second problem. Testing it in its own file, this works, but in my program for some reason it doesn`t.
Code:
def getTable(url):
currentMatchup = Crawl.setup(url)
teams = currentMatchup.title.string
print(teams)
re.split(r"\s+at\s+|\s+-\s+", teams)
print(teams)
The output is:
Columbus Blue Jackets at Buffalo Sabres - 10/09/2014
Columbus Blue Jackets at Buffalo Sabres - 10/09/2014
Any ideas?
Upvotes: 1
Views: 147
Reputation: 67998
print re.split(r"\s+at\s+|\s+-\s+",teams)
Output:['Columbus Blue Jackets', 'Buffalo Sabres', '10/09/2014']
Try this.You can do it in one line.Here teams
is your string.This will give you desired results.
Edit:
def getTable(url):
currentMatchup = Crawl.setup(url)
teams = currentMatchup.title.string
print(teams)
y=re.split(r"\s+at\s+|\s+-\s+", teams)
print(y)
Upvotes: 1
Reputation: 174874
You could split the input string according to <space>at<space>
or <space>-<space>
,
>>> s = "Columbus Blue Jackets at Buffalo Sabres - 10/09/2014"
>>> re.split(r'\s+(?:at|-)\s+', s)
['Columbus Blue Jackets', 'Buffalo Sabres', '10/09/2014']
>>> s = 'Montreal Canadiens at Buffalo Sabres - 10/09/2014'
>>> re.split(r'\s+(?:at|-)\s+', s)
['Montreal Canadiens', 'Buffalo Sabres', '10/09/2014']
Through re.findall
function,
>>> s = "Columbus Blue Jackets at Buffalo Sabres - 10/09/2014"
>>> re.findall(r'\b[A-Z]\S+(?:\s+[A-Z]\S+){1,}|(?<=-\s)\S+', s)
['Columbus Blue Jackets', 'Buffalo Sabres', '10/09/2014']
Upvotes: 1
Reputation: 20486
Capture them into groups with lazy dot-match-all repetition.
(.*?)\s+at\s+(.*?)\s+-\s+(\d{2}/\d{2}/\d{4})
import re;
match = re.search(r"(.*?)\s+at\s+(.*?)\s+-\s+(\d{2}/\d{2}/\d{4})", "Columbus Blue Jackets at Buffalo Sabres - 10/09/2014")
print match.groups()
# ('Columbus Blue Jackets', 'Buffalo Sabres', '10/09/2014')
Upvotes: 0