Asteroid098
Asteroid098

Reputation: 2825

how to extract text between tags using beautifulsoup in python

I am trying to extract text from the following html structure:

<div class= "story-body story-content">
 <p>
  <br>
  "the text I want to get"
  <a href= "http://...>
  <br>
  "the text I want to get"
  <a href="http:// ... >
  .
  .

I've already extracted the hyper links, but I don't know how to extract the text as well. So far I tried:

names = []
for div in soup3.find_all("div", attrs={"class" : "story-body story-content"}):
    for t in div.find_all('br'):
        t = t.get_text()
        names.append(t)

But I only get:

[<br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'']

Upvotes: 2

Views: 6069

Answers (2)

Mark
Mark

Reputation: 575

html = """
<div class= "story-body story-content">
<p>
<br>
"the text I want to get"
<a href= "http://...>
<br>
"the text I want to get"
<a href="http:// ... >
"""
s = BeautifulSoup(html, 'html.parser')
s.br.nextSibling

Will return:

'\n  "the text I want to get"\n  '

or:

s.br.nextSibling.strip()

Upvotes: 0

宏杰李
宏杰李

Reputation: 12178

for div in soup3.find_all("div", attrs={"class" : "story-body story-content"}):
    text_list = [text for text in div.stripped_strings]

use stripped_string to get all the non-empty string under a tag

The <br> tag inserts a single line break. it does not contain any text.

Upvotes: 4

Related Questions