Jacquelyn.Marquardt
Jacquelyn.Marquardt

Reputation: 630

BeautifulSoup Remove all html tags except for those in whitelist such as "img" and "a" tags with python

Given some html code, how can I remove all the tags, keep the text and img and a tags? For example, I have

<div><script bla bla></script><p>Hello all <a href ="xx"></a> <img rscr="xx"></img></p></div>

I want to keep

Hello to <a href ="xx"></a> <img rscr="xx"></img>

Is there something implemented in BeautifulSoup or Python?

Thanks

Upvotes: 5

Views: 3417

Answers (2)

Josh Crozier
Josh Crozier

Reputation: 241078

You could select all of the descendant nodes by accessing the .descendants property.

From there, you could iterate over all of the descendants and filter them based on the name property. If the node doesn't have a name property, then it is likely a text node, which you want to keep. If the name property is a or img, then you keep it as well.

# This should be the wrapper that you are targeting
container = soup.find('div')
keep = []

for node in container.descendants:
  if not node.name or node.name == 'a' or node.name == 'img':
    keep.append(node)

Here is an alternative where all the filtered elements are used to create the list directly:

# This should be the wrapper that you are targeting
container = soup.find('div')

keep = [node for node in container.descendants
        if not node.name or node.name == 'a' or node.name == 'img']

Also, if you don't want strings that are empty to be returned, you can trim the whitespace and check for that as well:

keep = [node for node in container.descendants
        if (not node.name and len(node.strip())) or
           (node.name == 'a' or node.name == 'img')]

Based on the HTML that you provided, the following would be returned:

> ['Hello all ', <a href="xx"></a>, <img rscr="xx"/>]

Upvotes: 3

宏杰李
宏杰李

Reputation: 12168

import bs4

html = '''<div><script bla bla></script><p>Hello all <a href ="xx"></a> <img rscr="xx"></img></p></div>'''

soup = bs4.BeautifulSoup(html, 'lxml')
soup.div.text, soup.div.find_next('a'), soup.div.find_next('img')

out:

('Hello all  ', <a href="xx"></a>, <img rscr="xx"/>)

When the next element is tag's descendent, there is a shortcut:

soup.div.text, soup.div.a, soup.div.img

out:

('Hello all  ', <a href="xx"></a>, <img rscr="xx"/>)
  1. when you use bs4's parser, 'img' tag will be self-closed tag
  2. You can aways use find_next to get next element in the DOM

Upvotes: 0

Related Questions