Reputation: 1

Python re string parsing

I'm trying to use re patterns within scrapy to parse a string. The string is of the format below. I am trying to retrieve the numbers within the font tags (e.g. 08:00). Easy enough to do in one list (\d+:\d+)+ but I need two separate lists of AM and PM. Can you only do this by creating two substrings - AM and PM - and then running the pattern against each of the substrings? The (AM - and (PM - are unique. It feels like you should be able to do it directly but I'm out of ideas. Thanks.

example input:

(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)

Upvotes: 0

Answers (2)

Trev Davies

Reputation: 391

If your string is always going to look like the example then you can do this using the following regex:

import re
capture = re.compile("(?<=>)[\d:]*(?=<)")
res = capture.findall("(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)")
for match in res:
    print(match)

This won't work if you have other types of tags in there though, as it just finds everything between > and < with no spaces.

Result:

Upvotes: 1

alecxe

Reputation: 474221

I would first eliminate the HTML tags and get the plain text to work with. For that, you can use an HTML parser, like BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> data = '(AM – 07:00 <font color=#0002fe>08:00</font> <font color=#0000dd>09:00</font> <font color=#0001fe>10:100</font>) <br> (PM – 18:00 <font color=#0000fe>190:00</font> <font color=#0000fe>175:00</font>)'
>>> soup = BeautifulSoup(data, "html.parser")
>>> data = soup.get_text()
>>> AM, PM = data.split("  ")
>>> AM
u'(AM \u2013 07:00 08:00 09:00 10:100)'
>>> PM
u'(PM \u2013 18:00 190:00 175:00)'

Upvotes: 3

Python re string parsing

Answers (2)

Related Questions