Reputation: 1728
I'm looking to use REGEX to extract quantity out of a shopping website. In the following example, I want to get "12.5 kilograms". However, the quantity within the first span is not always listed in kilograms; it could be lbs., oz., etc.
<td class="size-price last first" colspan="4">
<span>12.5 kilograms </span>
<span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
</span>
</td>
The code above is only a small portion of what is actually extracted using BeautifulSoup. Whatever the page is, the quantity is always within a span and is on a new line after
<td class="size-price last first" colspan="4">
I've used REGEX in the past but I am far from an expert. I'd like to know how to match elements between different lines. In this case between
<td class="size-price last first" colspan="4">
and
<span> <span class="strike">
Upvotes: 1
Views: 71
Reputation: 473803
Avoid parsing HTML with regex. Use the tool for the job, an HTML parser, like BeautifulSoup
- it is powerful, easy to use and it can perfectly handle your case:
from bs4 import BeautifulSoup
data = """
<td class="size-price last first" colspan="4">
<span>12.5 kilograms </span>
<span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
</span>
</td>"""
soup = BeautifulSoup(data)
print soup.td.span.text
prints:
12.5 kilograms
Or, if the td
is a part of a bigger structure, find it by class and get the first span's text out of it:
print soup.find('td', {'class': 'size-price'}).span.text
UPD (handling multiple results):
print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]
Hope that helps.
Upvotes: 1