Mehdi Rahimi
Mehdi Rahimi

Reputation: 177

Remove All html tag except one tag by BeautifulSoup

I need to extract all the text and <a> tags from a page but I dont know how to do it. Here is what I have so far:

from bs4 import BeautifulSoup

def cleanMe(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
    script.decompose()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text with this <a href="http://example.com/">link</a> captured.</body>"
cleaned = cleanMe(testhtml)
print (cleaned)

Output:

THIS IS AN EXAMPLE I need this text with this link captured.

My desired output:

THIS IS AN EXAMPLE I need this text with this <a href="http://example.com/">link</a> captured.

Upvotes: 3

Views: 2514

Answers (2)

Aminah Nuraini
Aminah Nuraini

Reputation: 19206

Consider using another library besides BeautifulSoup. I use this:

from bleach import clean

def strip_html(self, src, allowed=['a']):
    return clean(src, tags=allowed, strip=True, strip_comments=True)

Upvotes: 8

kaza
kaza

Reputation: 2327

Consider the below:-

def cleanMe(html):
    soup = BeautifulSoup(html,'html.parser') # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.decompose()
    # get text
    text = soup.get_text()
    for link in soup.find_all('a'):
        if 'href' in link.attrs:
            repl=link.get_text()
            href=link.attrs['href']
            link.clear()
            link.attrs={}
            link.attrs['href']=href
            link.append(repl)
            text=re.sub(repl+'(?!= *?</a>)',str(link),text,count=1)

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

What we've done new is below

    for link in soup.find_all('a'):
        text=re.sub(link.get_text()+'(?!= *?</a>)',str(link),text,count=1)

For each set of anchor tags replace once the text in the anchor(link) with the whole anchor itself. Note that we make replacement only once on the first appearing link text.

The regex link.get_text()+'(?!= *?</a>)' makes sure that we only replace the link text only if it was not replaced already.

(?!= *?</a>) is a negative lookahead which avoids any link that does not occur with a </a> appended.

But this is not the most fool proof way. Most fool proof way is to go through each tag and get the text out.

See the working code here

Upvotes: 0

Related Questions