Reputation: 1
I'm trying to use re patterns within scrapy to parse a string. The string is of the format below. I am trying to retrieve the numbers within the font tags (e.g. 08:00
). Easy enough to do in one list (\d+:\d+)+
but I need two separate lists of AM
and PM
. Can you only do this by creating two substrings - AM
and PM
- and then running the pattern against each of the substrings? The (AM -
and (PM -
are unique. It feels like you should be able to do it directly but I'm out of ideas. Thanks.
example input:
(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)
Upvotes: 0
Views: 116
Reputation: 391
If your string is always going to look like the example then you can do this using the following regex:
import re
capture = re.compile("(?<=>)[\d:]*(?=<)")
res = capture.findall("(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)")
for match in res:
print(match)
This won't work if you have other types of tags in there though, as it just finds everything between >
and <
with no spaces.
Result:
08:00
09:00
10:100
190:00
175:00
Upvotes: 1
Reputation: 474161
I would first eliminate the HTML tags and get the plain text to work with. For that, you can use an HTML parser, like BeautifulSoup
:
>>> from bs4 import BeautifulSoup
>>> data = '(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)'
>>> soup = BeautifulSoup(data, "html.parser")
>>> data = soup.get_text()
>>> AM, PM = data.split(" ")
>>> AM
u'(AM \u2013 07:00 08:00 09:00 10:100)'
>>> PM
u'(PM \u2013 18:00 190:00 175:00)'
Upvotes: 3