Reputation: 110492

Regex within html tags

I would like to parse the HD price from the following snipper of HTML. I am only have fragments of the html code, so I cannot use an HTML parser for this.

<div id="left-stack">        
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>

Basically, the format would be to "Find the price before the word "HD Version" (case insensitive). Here is what I have so far:

re.match(r'^(\d|.){1,6}...HD\sVersion', string)

How would I extract the value "19.99" from the above string?

Upvotes: 1

Answers (5)

Adam Smith

Reputation: 54243

The current BeautifulSoup answers only show how to grab all <span class="price"> tags. This is better:

from bs4 import BeautifulSoup

soup = """<div id="left-stack">        
 <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>"""

for HD_Version in (tag for tag in soup('li') if tag.text.lower() == 'hd version'):
    price = HD_Version.parent.findPreviousSibling('span', attrs={'class':'price'}).text

In general, using regular expressions to parse an irregular language like HTML is asking for trouble. Stick with an established parser.

Upvotes: 2

hwnd

Reputation: 70742

You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use BeautifulSoup for this.

>>> from bs4 import BeautifulSoup
>>> html = '''
<div id="left-stack">        
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>'''
>>> soup = BeautifulSoup(html)
>>> val  = soup.find('span', {'class':'price'}).text
>>> print val[1:]
19.99

Upvotes: 4

alecxe

Reputation: 474171

BeautifulSoup is very lenient to the HTML it parses, you can use it for the chunks/parts of HTML too:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(data)
print soup.find('span', class_='price').text[1:]

Prints:

19.99

Upvotes: 4

Padraic Cunningham

Reputation: 180512

You can still parse using BeautifulSoup, you don't need the full html:

from bs4 import BeautifulSoup
html="""
<div id="left-stack">
  <span>View In iTunes</span></a>
 <span class="price">£19.99</span>
 <ul class="list">
    <li>HD Version</li>
"""

soup = BeautifulSoup(html)
sp = soup.find(attrs={"class":"price"}) 
print sp.text[1:]
19.99

Upvotes: 2

Unihedron

Reputation: 11051

You can use this regex:

\d+(?:\.\d+)?(?=\D+HD Version)

\D+ skips ahead of non-digits in a lookahead, effectively asserting that our match (19.99) is the last digit ahead of HD Version.

Here is a regex demo.

Use the i modifier in the regex to make the matching case-insensitive and change + to* if the number can be directly before HD Version.

Upvotes: 0

Regex within html tags

Answers (5)

Related Questions