Reputation: 1813
I have a simple code in Python:
from bs4 import BeautifulSoup
import urllib2
webpage = urllib2.urlopen('http://fakepage.html')
soup = BeautifulSoup(webpage)
for anchor in soup.find_all("div", id="description"):
print anchor
I almost get what I want, but between <div id=description>
and </div>
I get lots of tags:
<div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>
I would like to get only text (not tags) that is between <div id=description>
and </div>
to count the words.
Is there any function in BeautifulSoup that can help me?
Upvotes: 0
Views: 105
Reputation: 1122022
Use the element.get_text()
method to get just the text:
for anchor in soup.find_all("div", id="description"):
print anchor.get_text()
You can pass in strip=True
to remove extra whitespace, and the first argument is used to join the stripped strings:
for anchor in soup.find_all("div", id="description"):
print anchor.get_text(' ', strip=True)
Demo:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>
... '''
>>> soup = BeautifulSoup(sample)
>>> for anchor in soup.find_all("div", id="description"):
... print anchor.get_text()
...
some text to show lots of useless tags
>>> for anchor in soup.find_all("div", id="description"):
... print anchor.get_text(' ', strip=True)
...
some text to show lots of useless tags
Upvotes: 2