thumbtackthief
thumbtackthief

Reputation: 6211

Remove height and width from inline styles

I'm using BeautifulSoup to remove inline heights and widths from my elements. Solving it for images was simple:

def remove_dimension_tags(tag):
    for attribute in ["width", "height"]:
        del tag[attribute]
    return tag

But I'm not sure how to go about processing something like this:

<div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">

when I would want to leave the background-color (for example) or any other style attributes other than height or width.

The only way I can think of doing it is with a regex but last time I suggested something like that the spirit of StackOverflow came out of my computer and murdered my first-born.

Upvotes: 1

Views: 1296

Answers (3)

宏杰李
宏杰李

Reputation: 12158

import bs4

html = '''<div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">'''

soup = bs4.BeautifulSoup(html, 'lxml')

Tag's attribute is a dict object, you can modify it like a dict:

get item:

soup.div.attrs

{'class': ['wp-caption', 'aligncenter'],
 'id': 'attachment_9565',
 'style': 'width: 2010px;background-color:red'}

set item:

soup.div.attrs['style'] = soup.div.attrs['style'].split(';')[-1]

{'class': ['wp-caption', 'aligncenter'],
 'id': 'attachment_9565',
 'style': 'background-color:red'}

Use Regex:

soup.div.attrs['style'] = re.search(r'background-color:\w+', soup.div.attrs['style']).group()

Upvotes: -1

Jan
Jan

Reputation: 43169

A full walk-through would be:

from bs4 import BeautifulSoup
import re

string = """
    <div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">
        <p>Some line here</p>
        <hr/>
        <p>Some other beautiful text over here</p>
    </div>
    """

# look for width or height, followed by not a ;
rx = re.compile(r'(?:width|height):[^;]+;?')

soup = BeautifulSoup(string, "html5lib")

for div in soup.findAll('div'):
    div['style'] = rx.sub("", string)

As stated by others, using regular expressions on the actual value is not a problem.

Upvotes: 2

Zroq
Zroq

Reputation: 8382

You could use regex if you want, but there is a simpler way.

Use cssutils for a simpler css parsing

A simple example:

from bs4 import BeautifulSoup
import cssutils

s = '<div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">'

soup = BeautifulSoup(s, "html.parser")
div = soup.find("div")
div_style = cssutils.parseStyle(div["style"])
del div_style["width"]
div["style"] = div_style.cssText
print (div)

Outputs:

>>><div class="wp-caption aligncenter" id="attachment_9565" style="background-color: red"></div>

Upvotes: 2

Related Questions