Reputation: 2079
I need to get info from a website that outputs it between <font color="red">needed-info-here</font>
OR <span style="font-weight:bold;">needed-info-here</span>
, randomly.
I can get it when I use
start = '<font color="red">'
end = '</font>'
expression = start + '(.*?)' + end
match = re.compile(expression).search(web_source_code)
needed_info = match.group(1)
, but then I have to pick to fetch either <font>
or <span>
, failing, when the site uses the other tag.
How do I modify the regular expression so it would always succeed?
Upvotes: 1
Views: 191
Reputation: 63792
Regex and HTML are not such a good match, HTML has too many potential variations that will trip up your regex. BeautifulSoup is the standard tool to employ here, but I find pyparsing can be just as effective, and sometimes even simpler to construct when trying to locate a particular tag relative to a particular previous tag.
Here is how to address your question using pyparsing:
html = """ need to get info from a website that outputs it between <font color="red">needed-info-here</font> OR <span style="font-weight:bold;">needed-info-here</span>, randomly.
<font color="white">but not this info</font> and
<span style="font-weight:normal;">dont want this either</span>
"""
from pyparsing import *
font,fontEnd = makeHTMLTags("FONT")
# only match <font> tags with color="red"
font.setParseAction(withAttribute(color="red"))
# only match <span> tags with given style
span,spanEnd = makeHTMLTags("SPAN")
span.setParseAction(withAttribute(style="font-weight:bold;"))
# define full match patterns, define "body" results name for easy access
fontpattern = font + SkipTo(fontEnd)("body") + fontEnd
spanpattern = span + SkipTo(spanEnd)("body") + spanEnd
# now create a single pattern, matching either of the other patterns
searchpattern = fontpattern | spanpattern
# call searchString, and extract body element from each match
for text in searchpattern.searchString(html):
print text.body
Prints:
needed-info-here
needed-info-here
Upvotes: 1
Reputation: 1952
expression = '(<font color="red">(.*?)</font>|<span style="font-weight:bold;">(.*?)</span>)'
match = re.compile(expression).search(web_source_code)
needed_info = match.group(2)
This would get the job done but you shouldn't really be using regex to parse html
Upvotes: 1
Reputation: 882751
You can join two alternatives with a vertical bar:
start = '<font color="red">|<span style="font-weight:bold;">'
end = '</font>|</span>'
since you know that a font tag will always be closed by </font>
, a span tag always by </span>
.
However, consider also using a solid HTML parser such as BeautifulSoup, rather than rolling your own regular expressions, to parse HTML, which is particularly unsuitable in general for getting parsed by regular expressions.
Upvotes: 3
Reputation: 3182
I haven't used Python, but if you make expressions equal to the following, it should work:
/(?P<open><(font|span)[^>]*>)(?P<info>[^<]+)(?P<close><\/(font|span)>)/gi
Then just access your needed info with the name "info".
PS - I also agree about the "not parsing HTML with regex" rule, but if you know that it will appear in either font or span tags, then so be it...
Also, why use the font tag? I haven't used a font tag since I learned CSS.
Upvotes: 0
Reputation: 80111
Although regular expressions are not your best choice for parsing HTML.
For the sake of education, here is a possible answer to your question:
start = '<(?P<tag>font|tag) color="red">'
end = '</(?P=tag)>'
expression = start + '(.*?)' + end
Upvotes: 1
Reputation: 123782
Regex is not the right tool to use for this problem. Look up BeautifulSoup or lxml.
Upvotes: 7