Reputation: 79
I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,
<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
should return:
Many hundreds of cultivars exist.
P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?
Upvotes: 3
Views: 31407
Reputation: 2932
Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.
from BeautifulSoup import BeautifulSoup
VALID_TAGS = ['div', 'p']
soup = BeautifulSoup(value)
for tag in soup.findAll('p'):
if tag.name not in VALID_TAGS:
tag.replaceWith(tag.renderContents())
print soup.renderContents()
Upvotes: 6