Reputation: 589
I want to scrape some data prices out of a bunch of html tables. The tables contain all sorts of prices, and of course the table data tags don't contain anything useful.
<div id="item-price-data">
<table>
<tbody>
<tr>
<td class="some-class">Normal Price:</td>
<td class="another-class">$100.00</td>
</tr>
<tr>
<td class="some-class">Member Price:</td>
<td class="another-class">$90.00</td>
</tr>
<tr>
<td class="some-class">Sale Price:</td>
<td class="another-class">$80.00</td>
</tr>
<tr>
<td class="some-class">You save:</td>
<td class="another-class">$20.00</td>
</tr>
</tbody>
</table>
</div>
The only prices that I care about are those that are paired with an element that has "Normal Price" as it's text.
What I'd like to be able to do is scan the table's descendants, find the <td>
tag that has that text, then pull the text from it's sibling.
The problem I'm having is that in BeautifulSoup the descendants
attribute returns a list of NavigableString
, not Tag
.
So if I do this:
from bs4 import BeautifulSoup
from urllib import request
html = request.urlopen(url)
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', {'id': 'item-price-data'})
table_data = div.find_all('td')
for element in table_data:
if element.get_text() == 'Normal Price:':
price = element.next_sibling
print(price)
I get nothing. Is there an easy way to get the string value back?
Upvotes: 3
Views: 1163
Reputation: 61225
You can use the find_next()
method also you may need a bit of regex:
Demo:
>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """<div id="item-price-data">
... <table>
... <tbody>
... <tr>
... <td class="some-class">Normal Price:</td>
... <td class="another-class">$100.00</td>
... </tr>
... <tr>
... <td class="some-class">Member Price:</td>
... <td class="another-class">$90.00</td>
... </tr>
... <tr>
... <td class="some-class">Sale Price:</td>
... <td class="another-class">$80.00</td>
... </tr>
... <tr>
... <td class="some-class">You save:</td>
... <td class="another-class">$20.00</td>
... </tr>
... </tbody>
... </table>
... </div>"""
>>> soup = BeautifulSoup(html, 'lxml')
>>> div = soup.find('div', {'id': 'item-price-data'})
>>> for element in div.find_all('td', text=re.compile('Normal Price')):
... price = element.find_next('td')
... print(price)
...
<td class="another-class">$100.00</td>
If you don't want to bring regex into this then the following will work for you.
>>> table_data = div.find_all('td')
>>> for element in table_data:
... if 'Normal Price' in element.get_text():
... price = element.find_next('td')
... print(price)
...
<td class="another-class">$100.00</td>
Upvotes: 1