Python: HTML generation performance improvement

Question

I'm currently supporting a legacy python application that generates all the html through creating individual tag objects.

We have a parent TAG class

class TAG(object):
    def __init__(self, tag="TAG", contents=None, **attributes):
        self.tag = tag
        self.contents = contents
        self.attributes = attributes

So every other tag inherits from TAG

class H1(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'H1', contents, **attributes)
class H2(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'H2', contents, **attributes)

The main TAG class has a to_string method that's something along the lines of

def to_string(self):
    yield '<{}'.format(self.tag)
    for (a, v) in self.attr_g():
        yield ' {}="{}"'.format(a, v)
    if self.NO_CONTENTS:
        yield '/>'
    else :
        yield '>'
        for c in self.contents:
            if isinstance(c, TAG):
                for i in c.str_g():
                    yield i
            else:
                yield c
        yield ''.format(self.tag)

We basically write out the result of the to_string method.

The issue comes to pages where there's a lot of TAGS being generated and is big enough to create a performance hit.

Are there any quick wins that I can do to make it perform better?

mattbasta · Accepted Answer

Preface: This is a terrible way to generate HTML, but if you're going to do it, you'd might as well do it the best way possible.

One thing that python is exceptionally good at is string formatting. If you're concatting lots of tiny strings, you're killing your performance from the get-go. Your to_string() method should look more like this:

def to_string(self):
    return """<{tag}{attributes}>{content}""".format(
        tag=self.tag,
        attributes=' '.join('%s="%s"' % (attr, val) for
                            attr, val in self.attributes),
        content=''.join(
            (n if isinstance(n, basestring) else n.to_string()) for
            n in self.contents))

Take note of a few things that I did there:

This is Python, not Java. Stack frames are expensive, so minimize function and method calls.
If you don't need a function to abstract a property, don't do it. I.e.: you don't need attr_g (except maybe to do escaping, but you can do that when you're putting the data in instead).
Do all of your string formatting on the same string! Having a single string formatting operation for a tiny string and then yielding it to be concatted is a huge waste.
Don't use a generator for this. Every time you yield, you're mussing around with the instruction pointer, which is going to inherently slow things down.

Other pointers:

You're inheriting from object, so use the super() function.

Don't waste code by writing constructors to declare the tag type:

class TAG(object):
    def __init__(self, contents=None, **attributes):
        self.contents = contents
        self.attributes = attributes

class H1(TAG):
    tag = 'H1'

class H2(TAG):
    tag = 'H2'

You might have some success with StringIO objects if you're doing a lot of this. They'll let you build your tags and .write() them in. You can think of them as .Net StringBuffers or Java's StringBuilders.

Python: HTML generation performance improvement

Answers (2)

Related Questions