anroots
anroots

Reputation: 2079

Python regex help needed

I need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.

I can get it when I use

start = '<font color="red">'
end = '</font>'
expression = start + '(.*?)' + end
match = re.compile(expression).search(web_source_code)
needed_info = match.group(1)

, but then I have to pick to fetch either <font> or <span>, failing, when the site uses the other tag.

How do I modify the regular expression so it would always succeed?

Upvotes: 1

Views: 191

Answers (6)

PaulMcG
PaulMcG

Reputation: 63792

Regex and HTML are not such a good match, HTML has too many potential variations that will trip up your regex. BeautifulSoup is the standard tool to employ here, but I find pyparsing can be just as effective, and sometimes even simpler to construct when trying to locate a particular tag relative to a particular previous tag.

Here is how to address your question using pyparsing:

html = """ need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
<font color="white">but not this info</font> and 
<span style="font-weight:normal;">dont want this either</span>
"""

from pyparsing import *

font,fontEnd = makeHTMLTags("FONT")
# only match <font> tags with color="red"
font.setParseAction(withAttribute(color="red"))
# only match <span> tags with given style
span,spanEnd = makeHTMLTags("SPAN")
span.setParseAction(withAttribute(style="font-weight:bold;"))

# define full match patterns, define "body" results name for easy access
fontpattern = font + SkipTo(fontEnd)("body") + fontEnd
spanpattern = span + SkipTo(spanEnd)("body") + spanEnd

# now create a single pattern, matching either of the other patterns
searchpattern = fontpattern | spanpattern

# call searchString, and extract body element from each match
for text in searchpattern.searchString(html):
    print text.body

Prints:

needed-info-here
needed-info-here

Upvotes: 1

Ed.
Ed.

Reputation: 1952

expression = '(<font color="red">(.*?)</font>|<span style="font-weight:bold;">(.*?)</span>)'
match = re.compile(expression).search(web_source_code)
needed_info = match.group(2)

This would get the job done but you shouldn't really be using regex to parse html

Upvotes: 1

Alex Martelli
Alex Martelli

Reputation: 882751

You can join two alternatives with a vertical bar:

start = '<font color="red">|<span style="font-weight:bold;">'
end = '</font>|</span>'

since you know that a font tag will always be closed by </font>, a span tag always by </span>.

However, consider also using a solid HTML parser such as BeautifulSoup, rather than rolling your own regular expressions, to parse HTML, which is particularly unsuitable in general for getting parsed by regular expressions.

Upvotes: 3

TCCV
TCCV

Reputation: 3182

I haven't used Python, but if you make expressions equal to the following, it should work:

/(?P<open><(font|span)[^>]*>)(?P<info>[^<]+)(?P<close><\/(font|span)>)/gi

Then just access your needed info with the name "info".

PS - I also agree about the "not parsing HTML with regex" rule, but if you know that it will appear in either font or span tags, then so be it...

Also, why use the font tag? I haven't used a font tag since I learned CSS.

Upvotes: 0

Wolph
Wolph

Reputation: 80111

Although regular expressions are not your best choice for parsing HTML.

For the sake of education, here is a possible answer to your question:

start = '<(?P<tag>font|tag) color="red">'
end = '</(?P=tag)>'
expression = start + '(.*?)' + end

Upvotes: 1

Katriel
Katriel

Reputation: 123782

Don't parse HTML with regex.

Regex is not the right tool to use for this problem. Look up BeautifulSoup or lxml.

Upvotes: 7

Related Questions