psmith
psmith

Reputation: 1813

Remove tags from div

I have a simple code in Python:

from bs4 import BeautifulSoup
import urllib2

webpage = urllib2.urlopen('http://fakepage.html')
soup = BeautifulSoup(webpage)

for anchor in soup.find_all("div", id="description"):
    print anchor

I almost get what I want, but between <div id=description> and </div> I get lots of tags:

<div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>

I would like to get only text (not tags) that is between <div id=description> and </div> to count the words. Is there any function in BeautifulSoup that can help me?

Upvotes: 0

Views: 105

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122022

Use the element.get_text() method to get just the text:

for anchor in soup.find_all("div", id="description"):
    print anchor.get_text()

You can pass in strip=True to remove extra whitespace, and the first argument is used to join the stripped strings:

for anchor in soup.find_all("div", id="description"):
    print anchor.get_text(' ', strip=True)

Demo:

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>
... '''
>>> soup = BeautifulSoup(sample)
>>> for anchor in soup.find_all("div", id="description"):
...     print anchor.get_text()
... 
some text to show  lots of  useless tags 
>>> for anchor in soup.find_all("div", id="description"):
...     print anchor.get_text(' ', strip=True)
... 
some text to show lots of useless tags

Upvotes: 2

Related Questions