interstar
interstar

Reputation: 27186

Is it possible to use Python / BeautifulSoup to strip all tags from a chunk of HTML EXCEPT anchor / links?

I have a chunk of HTML and I'd like to strip all tags to leave it as plaintext EXCEPT leaving in the <a href="url">some text<a> links.

Is this possible / simple in BeautifulSoup?

Upvotes: 3

Views: 1230

Answers (1)

Spencer
Spencer

Reputation: 1522

Try this.

import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name not in ('a'):
        print(tag.string)
    elif isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a'):
        print(tag)

Upvotes: 3

Related Questions