Reputation: 117
How do I extract all the text below a specific header? In this case, I need to extract the text under Topic 2
. EDIT: On other webpages, "Topic 2" sometimes appears as the third heading, or the first. "Topic 2" isn't always in the same place, and it doesn't always have the same id number.
# import library
from bs4 import BeautifulSoup
# dummy webpage text
body = '''
<h2 id="1">Topic 1</h2>
<p> This is the first sentence.</p>
<p> This is the second sentence.</p>
<p> This is the third sentence.</p>
<h2 id="2">Topic 2</h2>
<p> This is the fourth sentence.</p>
<p> This is the fifth sentence.</p>
<h2 id="3">Topic 3</h2>
<p> This is the sixth sentence.</p>
<p> This is the seventh sentence.</p>
<p> This is the eighth sentence.</p>
'''
# convert text to soup
soup = BeautifulSoup(body, 'lxml')
If I extract text only under '''Topic 2''', this is what my output would be.
This is the fourth sentence. This is the fifth sentence.
My attempts to solve this problem:
I tried soup.select('h2 + p')
, but this only got me the first sentences under each header.
[<p> This is the first sentence.</p>,
<p> This is the fourth sentence.</p>,
<p> This is the sixth sentence.</p>]
I also tried this, but it gave me all the text, when I only need text under Topic 2
:
import pandas as pd
lst = []
for row in soup.find_all('p'):
text_dict = {}
text_dict['text'] = row.text
lst.append(text_dict)
df = pd.DataFrame(lst)
df
| | text |
|---|-------------------------------|
| 0 | This is the first sentence. |
| 1 | This is the second sentence. |
| 2 | This is the third sentence. |
| 3 | This is the fourth sentence. |
| 4 | This is the fifth sentence. |
| 5 | This is the sixth sentence. |
| 6 | This is the seventh sentence. |
| 7 | This is the eighth sentence. |
Upvotes: 5
Views: 7835
Reputation: 24940
Try:
target = soup.find('h2',string='Topic 2')
for sib in target.find_next_siblings():
if sib.name=="h2":
break
else:
print(sib.text)
Output (from you html above):
This is the fourth sentence.
This is the fifth sentence.
Upvotes: 4
Reputation: 25
Different approach..
import pdfplumber
import re
pdfToString = ""
with pdfplumber.open(r"sample.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
pdfToString += page.extract_text()
matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in matches:
if "word_to_extractenter code here" in i[:50]:
print(i)
This solution is to extract all the headings which has same format of headings in the question and to extract the required heading and the paragraphs that follows it.
Upvotes: 0
Reputation: 4561
The problem is that you think the text us under the header. Technically, the text nodes are siblings of the headers, so the only way get them is the more sequential process of iterating through siblings:
More like:
h2 = soup.find('h2', id='2')
for sibling in h2.next_siblings:
if sibling.name != (None, 'p'):
break;
# ... do what you like with the <p> node
(Note that a BeautifulSoup sibling of < h2 > is an string element, usually a newline, name == None, so make sure you handle or ignore it properly.)
Upvotes: 0