BeautifulSoup Remove all html tags except for those in whitelist such as "img" and "a" tags with python

Question

Given some html code, how can I remove all the tags, keep the text and img and a tags? For example, I have

Hello all

I want to keep

Hello to

Is there something implemented in BeautifulSoup or Python?

Thanks

Josh Crozier · Accepted Answer

You could select all of the descendant nodes by accessing the .descendants property.

From there, you could iterate over all of the descendants and filter them based on the name property. If the node doesn't have a name property, then it is likely a text node, which you want to keep. If the name property is a or img, then you keep it as well.

# This should be the wrapper that you are targeting
container = soup.find('div')
keep = []

for node in container.descendants:
  if not node.name or node.name == 'a' or node.name == 'img':
    keep.append(node)

Here is an alternative where all the filtered elements are used to create the list directly:

# This should be the wrapper that you are targeting
container = soup.find('div')

keep = [node for node in container.descendants
        if not node.name or node.name == 'a' or node.name == 'img']

Also, if you don't want strings that are empty to be returned, you can trim the whitespace and check for that as well:

keep = [node for node in container.descendants
        if (not node.name and len(node.strip())) or
           (node.name == 'a' or node.name == 'img')]

Based on the HTML that you provided, the following would be returned:

> ['Hello all ', , ]

BeautifulSoup Remove all html tags except for those in whitelist such as "img" and "a" tags with python

Answers (2)

Related Questions

BeautifulSoup Remove all html tags except for those in whitelist such as &quot;img&quot; and &quot;a&quot; tags with python

Answers (2)

Related Questions

BeautifulSoup Remove all html tags except for those in whitelist such as "img" and "a" tags with python