Reputation: 1712
I am using BeautifulSoup to parse HTML files. I have a HTML file similar to this:
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>THE GOOD STUFF</h3>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
I want to extract the string "I WANT THIS STRING". The perfect solution would be to get the first table following the h3 heading called "THE GOOD STUFF". I have no idea how to do this with BeautifulSoup - I only know how to extract a table with a specific class, or a table nested within some particular tag, but not following a particular tag.
I think a fallback solution could make use of the string "Key C", assuming it's unique (it almost certainly is) and appears in only that one table, but I'd feel better with going for the specific h3 heading.
Upvotes: 4
Views: 2107
Reputation: 26
I am sure there are many ways to this more efficiently, but here is what I can think about right now:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
if flag == 'print':
print(td.text)
break
if td.text == 'Key C':
flag = 'print'
Output:
I WANT THIS STRING
Upvotes: 0
Reputation: 2600
Following the logic of @Zroq's answer on another question, this code will give you the table following your defined header ("THE GOOD STUFF"). Please note I just put all your html in the variable called "html".
from bs4 import BeautifulSoup, NavigableString, Tag
soup=BeautifulSoup(html, "lxml")
for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print(nextNode)
Output:
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
Cheers!
Upvotes: 3