Reputation: 2439
I need to get the raw text of the html page, but only the text that comes after h1 title.
I can get the h1 of the main body like this:
soup = BeautifulSoup(content.decode('utf-8','ignore'), 'html.parser')
extracted_h1 = soup.body.h1
My idea was something like this, get all elements and compare them to the h1 I extracted above. Then append all elements after h1 to a separate list and after that get all the saved elements of the list and use getText() on them.
# find all html elements
found = soup.findAll() # text=True
fill_element = list()
for element in found:
# something like this, but it doesn't work
if element == extracted_h1:
# after this start appending the elements to fill_element list
But this doesn't work. Any ideas how this could be achieved?
Upvotes: 0
Views: 3890
Reputation: 13459
Why don't you try find_all_next
on the h1
tag and get the text attributes?
Example:
>>> import bs4
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <!-- START--><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p><!-- END -->
... <p class="story">...</p>
... """
...
>>> soup = bs4.BeautifulSoup(html_doc, 'html.parser')
>>> print(soup.text)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
>>> print(''.join(soup.find_all('p')[1].find_all_next(text=True)))
Once upon a time there were three little sisters; and their names were
Elsie,
STARTLacie and
Tillie;
and they lived at the bottom of a well. END
...
Upvotes: 1
Reputation: 175
Supposing you are usi BeautifulSoup 4.4, you have this method:
soup.body.h1.find_all_next(string=True)
This get all elements after first h1
, the first is the text of the h1
itself.
Upvotes: 1