rarora7777
rarora7777

Reputation: 79

How to find all text inside <p> elements in an HTML page using BeautifulSoup

I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,
<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
should return:
Many hundreds of cultivars exist.

P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?

Upvotes: 3

Views: 31407

Answers (2)

silent1mezzo
silent1mezzo

Reputation: 2932

Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

soup = BeautifulSoup(value)

for tag in soup.findAll('p'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

Reference

Upvotes: 6

0x90
0x90

Reputation: 40982

soup.findAll('p')

here is a reference

Upvotes: 14

Related Questions