Reputation: 3538
I'm doing some HTML cleaning with BeautifulSoup. Noob to both Python & BeautifulSoup. I've got tags being removed correctly as follows, based on an answer I found elsewhere on Stackoverflow:
[s.extract() for s in soup('script')]
But how to remove inline styles? For instance the following:
<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">
Should become:
<p>Text</p>
<img href="somewhere.com">
How to delete the inline class, id, name & style attributes of all elements?
Answers to other similar questions I could find all mentioned using a CSS parser to handle this, rather than BeautifulSoup, but as the task is simply to remove rather than manipulate the attributes, and is a blanket rule for all tags, I was hoping to find a way to do it all within BeautifulSoup.
Upvotes: 19
Views: 27529
Reputation: 11
I achieved this using re and regex.
import re
def removeStyle(html):
style = re.compile(' style\=.*?\".*?\"')
html = re.sub(style, '', html)
return(html)
html = '<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>'
removeStyle(html)
Output: <p class="author" id="author_id" name="author_name">Text</p>
You can use this to strip any inline attribute by replacing "style" in the regex with the attribute's name.
Upvotes: 0
Reputation: 4151
What about lxml's Cleaner?
from lxml.html.clean import Cleaner
content_without_styles = Cleaner(style=True).clean_html(content)
Upvotes: 2
Reputation: 950
Not perfect but short:
' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);
Upvotes: 0
Reputation: 9689
Here's my solution for Python3 and BeautifulSoup4:
def remove_attrs(soup, whitelist=tuple()):
for tag in soup.findAll(True):
for attr in [attr for attr in tag.attrs if attr not in whitelist]:
del tag[attr]
return soup
It supports a whitelist of attributes which should be kept. :) If no whitelist is supplied all the attributes get removed.
Upvotes: 4
Reputation: 69
Based on jmk's function, i use this function to remove attributes base on a white list:
Work in python2, BeautifulSoup3
def clean(tag,whitelist=[]):
tag.attrs = None
for e in tag.findAll(True):
for attribute in e.attrs:
if attribute[0] not in whitelist:
del e[attribute[0]]
#e.attrs = None #delte all attributes
return tag
#example to keep only title and href
clean(soup,["title","href"])
Upvotes: 1
Reputation: 15680
I wouldn't do this in BeautifulSoup
- you'll spend a lot of time trying, testing, and working around edge cases.
Bleach
does exactly this for you. http://pypi.python.org/pypi/bleach
If you were to do this in BeautifulSoup
, I'd suggest you go with the "whitelist" approach, like Bleach
does. Decide which tags may have which attributes, and strip every tag/attribute that doesn't match.
Upvotes: 11
Reputation: 1988
You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
Also, if you just want to delete entire tags (and their contents), you don't need extract()
, which returns the tag. You just need decompose()
:
[tag.decompose() for tag in soup("script")]
Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.
Upvotes: 36