Reputation: 110083
I would like to parse the HD price from the following snipper of HTML. I am only have fragments of the html code, so I cannot use an HTML parser for this.
<div id="left-stack">
<span>View In iTunes</span></a>
<span class="price">£19.99</span>
<ul class="list">
<li>HD Version</li>
Basically, the format would be to "Find the price before the word "HD Version" (case insensitive). Here is what I have so far:
re.match(r'^(\d|.){1,6}...HD\sVersion', string)
How would I extract the value "19.99" from the above string?
Upvotes: 1
Views: 119
Reputation: 54163
The current BeautifulSoup answers only show how to grab all <span class="price">
tags. This is better:
from bs4 import BeautifulSoup
soup = """<div id="left-stack">
<span>View In iTunes</span></a>
<span class="price">£19.99</span>
<ul class="list">
<li>HD Version</li>"""
for HD_Version in (tag for tag in soup('li') if tag.text.lower() == 'hd version'):
price = HD_Version.parent.findPreviousSibling('span', attrs={'class':'price'}).text
In general, using regular expressions to parse an irregular language like HTML is asking for trouble. Stick with an established parser.
Upvotes: 2
Reputation: 70722
You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use BeautifulSoup for this.
>>> from bs4 import BeautifulSoup
>>> html = '''
<div id="left-stack">
<span>View In iTunes</span></a>
<span class="price">£19.99</span>
<ul class="list">
<li>HD Version</li>'''
>>> soup = BeautifulSoup(html)
>>> val = soup.find('span', {'class':'price'}).text
>>> print val[1:]
19.99
Upvotes: 4
Reputation: 473753
BeautifulSoup
is very lenient to the HTML it parses, you can use it for the chunks/parts of HTML too:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
data = u"""
<div id="left-stack">
<span>View In iTunes</span></a>
<span class="price">£19.99</span>
<ul class="list">
<li>HD Version</li>
"""
soup = BeautifulSoup(data)
print soup.find('span', class_='price').text[1:]
Prints:
19.99
Upvotes: 4
Reputation: 180391
You can still parse using BeautifulSoup
, you don't need the full html:
from bs4 import BeautifulSoup
html="""
<div id="left-stack">
<span>View In iTunes</span></a>
<span class="price">£19.99</span>
<ul class="list">
<li>HD Version</li>
"""
soup = BeautifulSoup(html)
sp = soup.find(attrs={"class":"price"})
print sp.text[1:]
19.99
Upvotes: 2
Reputation: 11041
You can use this regex:
\d+(?:\.\d+)?(?=\D+HD Version)
\D+
skips ahead of non-digits in a lookahead, effectively asserting that our match (19.99
) is the last digit ahead of HD Version
.Here is a regex demo.
Use the i
modifier in the regex to make the matching case-insensitive and change +
to*
if the number can be directly before HD Version
.
Upvotes: 0