Reputation: 61
I am trying to get all the <p>
that come after <h2>
.
I know how to do this in case I have only one <p>
after <h2>
, but not in case I have multiple <p>
.
Here's an example of the webpage:
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
....
I need to get all paragraphs in relation to their headings, e.g. Paragraphs 1 and 2 that are related to Heading Text1.
I'm trying that using BeautifulSoup
with Python
, been trying for days, also googling.
How can this be done?
Upvotes: 1
Views: 993
Reputation: 154
This is almost identical to the question posed yesterday. You can solve this in few different ways. Here is how I would do it:
from bs4 import BeautifulSoup
html = """
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
<h2>Heading Text3</h2>
<p>Paragraph6</p>
<p>Paragraph7</p>
<p>Paragraph8</p>
<p>Paragraph9</p>
"""
soup = BeautifulSoup(html,"html.parser")
data_dict = {}
for item in soup.select("h2"):
header = item.get_text(strip=True)
content = []
for i in item.next_siblings:
if i.name=="h2": break
content.extend([x for x in i.stripped_strings])
data_dict[header] = content
print(data_dict)
Upvotes: 3
Reputation: 25073
You could get your goal while working with a dict
and .find_previous()
- Iterate all <p>
, find its previous <h2>
and set it as key in your dict
, than simply append the texts to its list
:
d = {}
for p in soup.select('p'):
if p.find_previous('h2'):
if d.get(p.find_previous('h2').text) == None:
d[p.find_previous('h2').text]=[]
else:
continue
d[p.find_previous('h2').text].append(p.text)
from bs4 import BeautifulSoup
html = '''
<p>Any Other Paragraph</p>
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
'''
soup = BeautifulSoup(html)
d = {}
for p in soup.select('p'):
if p.find_previous('h2'):
if d.get(p.find_previous('h2').text) == None:
d[p.find_previous('h2').text]=[]
else:
continue
d[p.find_previous('h2').text].append(p.text)
d
{'Heading Text1': ['Paragraph1', 'Paragraph2'],
'Heading Text2': ['Paragraph3', 'Paragraph4', 'Paragraph5']}
Upvotes: 3
Reputation: 1146
This is how I would do it, I will get all the h2
, p
tags and iterate through them saving the last h2
tag content and tying it to the paragraphs next to it.
from bs4 import BeautifulSoup
html = '''
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
'''
soup = BeautifulSoup(html, 'html.parser')
dict_to_save = {}
# find all the 'h2' and 'p' tags
for tag in soup(['h2','p']):
# if 'h2' tag save it into a variable named header
if tag.name == 'h2':
header = tag.text.strip()
# if not 'h2' tag add this paragraph to the last header
else:
dict_to_save[header] = dict_to_save.get(header, []) + [tag.text.strip()]
print(dict_to_save)
{'Heading Text1': ['Paragraph1', 'Paragraph2'],
'Heading Text2': ['Paragraph3', 'Paragraph4', 'Paragraph5']}
Upvotes: 1