th3qu33n
th3qu33n

Reputation: 117

Use BeautifulSoup to extract text under specific header

How do I extract all the text below a specific header? In this case, I need to extract the text under Topic 2. EDIT: On other webpages, "Topic 2" sometimes appears as the third heading, or the first. "Topic 2" isn't always in the same place, and it doesn't always have the same id number.

# import library
from bs4 import BeautifulSoup

# dummy webpage text
body = '''
<h2 id="1">Topic 1</h2>
<p> This is the first sentence.</p>
<p> This is the second sentence.</p>
<p> This is the third sentence.</p>

<h2 id="2">Topic 2</h2>
<p> This is the fourth sentence.</p>
<p> This is the fifth sentence.</p>

<h2 id="3">Topic 3</h2>
<p> This is the sixth sentence.</p>
<p> This is the seventh sentence.</p>
<p> This is the eighth sentence.</p>
'''

# convert text to soup 
soup = BeautifulSoup(body, 'lxml')

If I extract text only under '''Topic 2''', this is what my output would be.

This is the fourth sentence. This is the fifth sentence.

My attempts to solve this problem:

I tried soup.select('h2 + p'), but this only got me the first sentences under each header.

[<p> This is the first sentence.</p>,
 <p> This is the fourth sentence.</p>,
 <p> This is the sixth sentence.</p>]

I also tried this, but it gave me all the text, when I only need text under Topic 2:

import pandas as pd 

lst = []
for row in soup.find_all('p'):
    text_dict = {}
    text_dict['text'] = row.text
    lst.append(text_dict)

df = pd.DataFrame(lst) 

df

|   | text                          |
|---|-------------------------------|
| 0 | This is the first sentence.   |
| 1 | This is the second sentence.  |
| 2 | This is the third sentence.   |
| 3 | This is the fourth sentence.  |
| 4 | This is the fifth sentence.   |
| 5 | This is the sixth sentence.   |
| 6 | This is the seventh sentence. |
| 7 | This is the eighth sentence.  |

Upvotes: 5

Views: 7835

Answers (3)

Jack Fleeting
Jack Fleeting

Reputation: 24940

Try:

target = soup.find('h2',string='Topic 2')
for sib in target.find_next_siblings():
    if sib.name=="h2":
        break
    else:
        print(sib.text)

Output (from you html above):

 This is the fourth sentence.
 This is the fifth sentence.

Upvotes: 4

Shahad
Shahad

Reputation: 25

Different approach..

import pdfplumber
import re
pdfToString = ""

with pdfplumber.open(r"sample.pdf") as pdf:
    for page in pdf.pages:
        print(page.extract_text())
        pdfToString += page.extract_text()

matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in  matches:
    if "word_to_extractenter code here" in i[:50]:
        print(i)

This solution is to extract all the headings which has same format of headings in the question and to extract the required heading and the paragraphs that follows it.

Upvotes: 0

pbuck
pbuck

Reputation: 4561

The problem is that you think the text us under the header. Technically, the text nodes are siblings of the headers, so the only way get them is the more sequential process of iterating through siblings:

  1. find a header
  2. find everything not a header & extract text
  3. find another header (or EOF) and stop.

More like:

h2 = soup.find('h2', id='2')
for sibling in h2.next_siblings:
   if sibling.name != (None, 'p'):
      break;
   # ... do what you like with the <p> node

(Note that a BeautifulSoup sibling of < h2 > is an string element, usually a newline, name == None, so make sure you handle or ignore it properly.)

Upvotes: 0

Related Questions