Alexander Engelhardt
Alexander Engelhardt

Reputation: 1712

How to extract HTML table following a specific heading?

I am using BeautifulSoup to parse HTML files. I have a HTML file similar to this:

<h3>Unimportant heading</h3>
<table class="foo">
  <tr>
    <td>Key A</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>


<h3>Unimportant heading</h3>
<table class="foo">
  <tr>
    <td>Key B</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>


<h3>THE GOOD STUFF</h3>
<table class="foo">
  <tr>
    <td>Key C</td>
  </tr>
  <tr>
    <td>I WANT THIS STRING</td>
  </tr>
</table>


<h3>Unimportant heading</h3>
<table class="foo">
  <tr>
    <td>Key A</td>
  </tr>
  <tr>
    <td>A value I don't want</td>
  </tr>
</table>

I want to extract the string "I WANT THIS STRING". The perfect solution would be to get the first table following the h3 heading called "THE GOOD STUFF". I have no idea how to do this with BeautifulSoup - I only know how to extract a table with a specific class, or a table nested within some particular tag, but not following a particular tag.

I think a fallback solution could make use of the string "Key C", assuming it's unique (it almost certainly is) and appears in only that one table, but I'd feel better with going for the specific h3 heading.

Upvotes: 4

Views: 2107

Answers (3)

Leo_28
Leo_28

Reputation: 26

I am sure there are many ways to this more efficiently, but here is what I can think about right now:

from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
    if flag == 'print':
        print(td.text)
        break
    if td.text == 'Key C':
        flag = 'print'

Output:

I WANT THIS STRING

Upvotes: 0

PythonSherpa
PythonSherpa

Reputation: 2600

Following the logic of @Zroq's answer on another question, this code will give you the table following your defined header ("THE GOOD STUFF"). Please note I just put all your html in the variable called "html".

from bs4 import BeautifulSoup, NavigableString, Tag

soup=BeautifulSoup(html, "lxml")

for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
    nextNode = header
    while True:
        nextNode = nextNode.nextSibling
        if nextNode is None:
            break
        if isinstance(nextNode, Tag):
            if nextNode.name == "h3":
                break
            print(nextNode)

Output:

<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>

Cheers!

Upvotes: 3

J_H
J_H

Reputation: 20450

The docs explain that if you don't want to use find_all, you can do this:

for sibling in soup.a.next_siblings:
    print(repr(sibling))

Upvotes: 0

Related Questions