Reputation: 2825
I am trying to extract text from the following html structure:
<div class= "story-body story-content">
<p>
<br>
"the text I want to get"
<a href= "http://...>
<br>
"the text I want to get"
<a href="http:// ... >
.
.
I've already extracted the hyper links, but I don't know how to extract the text as well. So far I tried:
names = []
for div in soup3.find_all("div", attrs={"class" : "story-body story-content"}):
for t in div.find_all('br'):
t = t.get_text()
names.append(t)
But I only get:
[<br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'']
Upvotes: 2
Views: 6069
Reputation: 575
html = """
<div class= "story-body story-content">
<p>
<br>
"the text I want to get"
<a href= "http://...>
<br>
"the text I want to get"
<a href="http:// ... >
"""
s = BeautifulSoup(html, 'html.parser')
s.br.nextSibling
Will return:
'\n "the text I want to get"\n '
or:
s.br.nextSibling.strip()
Upvotes: 0
Reputation: 12178
for div in soup3.find_all("div", attrs={"class" : "story-body story-content"}):
text_list = [text for text in div.stripped_strings]
use stripped_string
to get all the non-empty string under a tag
The <br>
tag inserts a single line break. it does not contain any text.
Upvotes: 4