get text after h1 using beautiful soup in Python

Question

I need to get the raw text of the html page, but only the text that comes after h1 title.

I can get the h1 of the main body like this:

soup = BeautifulSoup(content.decode('utf-8','ignore'), 'html.parser')
extracted_h1 = soup.body.h1

My idea was something like this, get all elements and compare them to the h1 I extracted above. Then append all elements after h1 to a separate list and after that get all the saved elements of the list and use getText() on them.

# find all html elements
found = soup.findAll() # text=True
fill_element = list()
for element in found:
    # something like this, but it doesn't work
    if element == extracted_h1:
       # after this start appending the elements to fill_element list

But this doesn't work. Any ideas how this could be achieved?

Oliver W. · Accepted Answer

Why don't you try find_all_next on the h1 tag and get the text attributes?

Example:

>>> import bs4
>>> html_doc = """
... The Dormouse's story
... 
... The Dormouse's story
... Once upon a time there were three little sisters; and their names were
... Elsie,
... Lacie and
... Tillie;
... and they lived at the bottom of a well.

... ...
... """
...
>>> soup = bs4.BeautifulSoup(html_doc, 'html.parser')
>>> print(soup.text)
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

>>> print(''.join(soup.find_all('p')[1].find_all_next(text=True)))

Once upon a time there were three little sisters; and their names were
Elsie,
 STARTLacie and
Tillie;
and they lived at the bottom of a well. END 
...

get text after h1 using beautiful soup in Python

Answers (2)

Related Questions