Parse HTML using BeautifulSoup depending on previous tag

Question

I have an HTML in which I have some tagged text following some titles. Something like this:

Title 1
Some text
Some other text

Title 2
Some text
Some text2

Title 3
Some text
Some other text

(The only fixed thing is the number of titles, the rest can change)

How can I extract with BeautifulSoup all the HTML following each but before the rest?

alecxe · Accepted Answer

You can pass a regular expression Title \d+ as a text argument and find all titles, then use find_next_siblings() to get the next two p tags:

import re
from bs4 import BeautifulSoup

data = """

    Title 1
    Some text
    Some other text

    Title 2
    Some text
    Some text2

    Title 3
    Some text
    Some other text

"""

soup = BeautifulSoup(data)

for h1 in soup.find_all('h1', text=re.compile('Title \d+')):
    for p in h1.find_next_siblings('p', limit=2):
        print p.text.strip()

prints:

Some text
Some other text
Some text
Some text2
Some text
Some other text

Or, using list-comprehension:

print [p.text.strip()
       for h1 in soup.find_all('h1', text=re.compile('Title \d+'))
       for p in h1.find_next_siblings('p', limit=2)]

prints:

[u'Some text', u'Some other text', u'Some text', u'Some text2', u'Some text', u'Some other text']

Parse HTML using BeautifulSoup depending on previous tag

Answers (1)

Related Questions