Reputation: 6063
I have an HTML in which I have some tagged text following some titles. Something like this:
<h1>Title 1</h1>
<p>Some text</p>
<p>Some other <b>text</b></p>
<h1>Title 2</h1>
<p>Some <b>text</b></p>
<p>Some text2</p>
<h1>Title 3</h1>
<p>Some text</p>
<p>Some other <i>text</i></p>
(The only fixed thing is the number of titles, the rest can change)
How can I extract with BeautifulSoup all the HTML following each but before the rest?
Upvotes: 1
Views: 461
Reputation: 473763
You can pass a regular expression Title \d+
as a text
argument and find all titles, then use find_next_siblings()
to get the next two p
tags:
import re
from bs4 import BeautifulSoup
data = """
<div>
<h1>Title 1</h1>
<p>Some text</p>
<p>Some other <b>text</b></p>
<h1>Title 2</h1>
<p>Some <b>text</b></p>
<p>Some text2</p>
<h1>Title 3</h1>
<p>Some text</p>
<p>Some other <i>text</i></p>
</div>
"""
soup = BeautifulSoup(data)
for h1 in soup.find_all('h1', text=re.compile('Title \d+')):
for p in h1.find_next_siblings('p', limit=2):
print p.text.strip()
prints:
Some text
Some other text
Some text
Some text2
Some text
Some other text
Or, using list-comprehension:
print [p.text.strip()
for h1 in soup.find_all('h1', text=re.compile('Title \d+'))
for p in h1.find_next_siblings('p', limit=2)]
prints:
[u'Some text', u'Some other text', u'Some text', u'Some text2', u'Some text', u'Some other text']
Upvotes: 1