Reputation: 13642
The Proper way to strip(not remove) specified tags from an HTML string using Python.
def strip_tags(html, tags=[]):
....
pass #return the html string by stripping the tags from the list
The questions explain it all.
I am to write a python function that takes HTML string as input, and list of tags to be stripped, (mimicking Django template's removetags
functionality as it's deprecated )
What's the simplest way of doing this?
The following approaches didn't work for me for the listed reasons:
Using regular expressions (for obvious reasons)
Clean() method by Bleach library. Surprisingly such a robust library is useless for this requirement, as it follows
a whitelist-first approach, while the problem is blacklist-first.
Bleach will only be useful to 'keep' certain tags but not for
removing certain (unless you are ready to maintain a huge list of all
possible ALLOWED_TAGS
)
lxml.html.Cleaner() combined with remove_tags
or kill_tags
This is somewhat closer to what I was looking for, but it goes
ahead and does(removes) more than what it is supposed to, And there
is no way to control the behaviour at the finest, like requesting the
Cleaner() to keep the evil <script>
tag.
BeautifulSoup. This has a method called clear() to remove the specified tags, but it removes the content of the tags while I only need to strip the tags but to keep the content.
Upvotes: 2
Views: 833
Reputation: 23134
You can extend Python's HTMLParser
and create your own parser to skip specified tags.
Using the example provided in the given link, I will modify it to strip <h1></h1>
tags but keep their data:
from html.parser import HTMLParser
NOT_ALLOWED_TAGS = ['h1']
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag not in NOT_ALLOWED_TAGS:
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
if tag not in NOT_ALLOWED_TAGS:
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
That will return:
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
# h1 start tag here
Encountered some data : Parse me!
# h1 close tag here
Encountered an end tag : body
Encountered an end tag : html
You can now maintain a NOT_ALLOWED_TAG
list to use for stripping those tags.
Upvotes: 1
Reputation: 29977
Beautiful soup has unwrap()
:
It replaces a tag with whatever’s inside that tag.
You will have to manually iterate over all tags you want to replace.
Upvotes: 2