Reputation: 2660
I'm trying to write a program that will take an HTML file and make it more email friendly. Right now all the conversion is done manually because none of the online converters do exactly what we need.
This sounded like a great opportunity to push the limits of my programming knowledge and actually code something useful so I offered to try to write a program in my spare time to help make the process more automated.
I don't know much about HTML or CSS so I'm mostly relying on my brother (who does know HTML and CSS) to describe what changes this program needs to make, so please bear with me if I ask a stupid question. This is totally new territory for me.
Most of the changes are pretty basic -- if you see tag/attribute X then convert it to tag/attribute Y. But I've run into trouble when dealing with an HTML tag containing a style attribute. For example:
<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
Whenever possible I want to convert the style attributes into HTML attributes (or convert the style attribute to something more email friendly). So after the conversion it should look like this:
<img src="http://example.com/file.jpg" width="150" height="50" align="right"/>
Now I realize that not all CSS style attributes have an HTML equivalent, so right now I only want to focus on the ones that do. I whipped up a Python script that would do this conversion:
from bs4 import BeautifulSoup
import re
class Styler(object):
img_attributes = {'float' : 'align'}
def __init__(self, soup):
self.soup = soup
def format_factory(self):
self.handle_image()
def handle_image(self):
tag = self.soup.find_all("img", style = re.compile('.'))
print tag
for i in xrange(len(tag)):
old_attributes = tag[i]['style']
tokens = [s for s in re.split(r'[:;]+|px', str(old_attributes)) if s]
del tag[i]['style']
print tokens
for j in xrange(0, len(tokens), 2):
if tokens[j] in Styler.img_attributes:
tokens[j] = Styler.img_attributes[tokens[j]]
tag[i][tokens[j]] = tokens[j+1]
if __name__ == '__main__':
html = """
<body>hello</body>
<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
<blockquote>my blockquote text</blockquote>
<div style="padding-left:25px; padding-right:25px;">text here</div>
<body>goodbye</body>
"""
soup = BeautifulSoup(html)
s = Styler(soup)
s.format_factory()
Now this script will handle my particular example just fine, but it's not very robust and I realize that when put up against real world examples it will easily break. My question is, how can I make this more robust? As far as I can tell Beautiful Soup doesn't have a way to change or extract individual pieces of a style attribute. I guess that's what I'm looking to do.
Upvotes: 2
Views: 4575
Reputation: 21
Instead of reinvent the wheel use the stoneage package http://pypi.python.org/pypi/StoneageHTML
Upvotes: 2
Reputation: 341
For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.
For example:
>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'
So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText
attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.
Upvotes: 11