David DeMar
David DeMar

Reputation: 2660

Using Beautiful Soup to convert CSS attributes to individual HTML attributes?

I'm trying to write a program that will take an HTML file and make it more email friendly. Right now all the conversion is done manually because none of the online converters do exactly what we need.

This sounded like a great opportunity to push the limits of my programming knowledge and actually code something useful so I offered to try to write a program in my spare time to help make the process more automated.

I don't know much about HTML or CSS so I'm mostly relying on my brother (who does know HTML and CSS) to describe what changes this program needs to make, so please bear with me if I ask a stupid question. This is totally new territory for me.

Most of the changes are pretty basic -- if you see tag/attribute X then convert it to tag/attribute Y. But I've run into trouble when dealing with an HTML tag containing a style attribute. For example:

<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />

Whenever possible I want to convert the style attributes into HTML attributes (or convert the style attribute to something more email friendly). So after the conversion it should look like this:

<img src="http://example.com/file.jpg" width="150" height="50" align="right"/>

Now I realize that not all CSS style attributes have an HTML equivalent, so right now I only want to focus on the ones that do. I whipped up a Python script that would do this conversion:

from bs4 import BeautifulSoup
import re

class Styler(object):

    img_attributes = {'float' : 'align'}

    def __init__(self, soup):
        self.soup = soup

    def format_factory(self):
        self.handle_image()

    def handle_image(self):
        tag = self.soup.find_all("img", style = re.compile('.'))
        print tag
        for i in xrange(len(tag)):
            old_attributes = tag[i]['style']
            tokens = [s for s in re.split(r'[:;]+|px', str(old_attributes)) if s]
            del tag[i]['style']
            print tokens

            for j in xrange(0, len(tokens), 2):
                if tokens[j] in Styler.img_attributes:
                    tokens[j] = Styler.img_attributes[tokens[j]]

                tag[i][tokens[j]] = tokens[j+1]

if __name__ == '__main__':
    html = """
    <body>hello</body>
    <img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
    <blockquote>my blockquote text</blockquote>
    <div style="padding-left:25px; padding-right:25px;">text here</div>
    <body>goodbye</body>
    """
    soup = BeautifulSoup(html)
    s = Styler(soup)
    s.format_factory()

Now this script will handle my particular example just fine, but it's not very robust and I realize that when put up against real world examples it will easily break. My question is, how can I make this more robust? As far as I can tell Beautiful Soup doesn't have a way to change or extract individual pieces of a style attribute. I guess that's what I'm looking to do.

Upvotes: 2

Views: 4575

Answers (2)

nueces
nueces

Reputation: 21

Instead of reinvent the wheel use the stoneage package http://pypi.python.org/pypi/StoneageHTML

Upvotes: 2

chigby
chigby

Reputation: 341

For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.

For example:

>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'

So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.

Upvotes: 11

Related Questions