Ivan Bilan
Ivan Bilan

Reputation: 2439

get text after h1 using beautiful soup in Python

I need to get the raw text of the html page, but only the text that comes after h1 title.

I can get the h1 of the main body like this:

soup = BeautifulSoup(content.decode('utf-8','ignore'), 'html.parser')
extracted_h1 = soup.body.h1

My idea was something like this, get all elements and compare them to the h1 I extracted above. Then append all elements after h1 to a separate list and after that get all the saved elements of the list and use getText() on them.

# find all html elements
found = soup.findAll() # text=True
fill_element = list()
for element in found:
    # something like this, but it doesn't work
    if element == extracted_h1:
       # after this start appending the elements to fill_element list

But this doesn't work. Any ideas how this could be achieved?

Upvotes: 0

Views: 3890

Answers (2)

Oliver W.
Oliver W.

Reputation: 13459

Why don't you try find_all_next on the h1 tag and get the text attributes?

Example:

>>> import bs4
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <!-- START--><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p><!-- END -->
... <p class="story">...</p>
... """
...
>>> soup = bs4.BeautifulSoup(html_doc, 'html.parser')
>>> print(soup.text)
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

>>> print(''.join(soup.find_all('p')[1].find_all_next(text=True)))

Once upon a time there were three little sisters; and their names were
Elsie,
 STARTLacie and
Tillie;
and they lived at the bottom of a well. END 
...

Upvotes: 1

Mattia Rossi
Mattia Rossi

Reputation: 175

Supposing you are usi BeautifulSoup 4.4, you have this method:

soup.body.h1.find_all_next(string=True)

This get all elements after first h1, the first is the text of the h1 itself.

Upvotes: 1

Related Questions