John Jiang
John Jiang

Reputation: 11489

Python: HTML generation performance improvement

I'm currently supporting a legacy python application that generates all the html through creating individual tag objects.

We have a parent TAG class

class TAG(object):
    def __init__(self, tag="TAG", contents=None, **attributes):
        self.tag = tag
        self.contents = contents
        self.attributes = attributes

So every other tag inherits from TAG

class H1(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'H1', contents, **attributes)
class H2(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'H2', contents, **attributes)

The main TAG class has a to_string method that's something along the lines of

def to_string(self):
    yield '<{}'.format(self.tag)
    for (a, v) in self.attr_g():
        yield ' {}="{}"'.format(a, v)
    if self.NO_CONTENTS:
        yield '/>'
    else :
        yield '>'
        for c in self.contents:
            if isinstance(c, TAG):
                for i in c.str_g():
                    yield i
            else:
                yield c
        yield '</{}>'.format(self.tag)

We basically write out the result of the to_string method.

The issue comes to pages where there's a lot of TAGS being generated and is big enough to create a performance hit.

Are there any quick wins that I can do to make it perform better?

Upvotes: 0

Views: 132

Answers (2)

Francis Avila
Francis Avila

Reputation: 31621

@mattbasta has the right idea here. However, I want to propose something a bit different: implement to_string using cElementTree.TreeBuilder. I don't know if the super-fast serialization of ElementTree will win out against overhead of creating an ElementTree.

Here is a wonky TAG class with a to_string_b() method that makes use of some micro-optimizations and uses a TreeBuilder to build the tree. (A possibly-important difference between your to_string() and TreeBuilder is that TreeBuilder will always escape output for XML, whereas yours will not.)

import xml.etree.cElementTree as ET

class TAG(object):
    def __init__(self, tag="TAG", contents=None, **attributes):
        self.tag = tag
        # this is to insure that `contents` always has a uniform
        # type.
        if contents is None:
            self.contents = []
        else:
            if isinstance(contents, basestring):
                # I suspect the calling code passes in a string as contents
                # in the common case, so this means that each character of
                # the string will be yielded one-by-one. let's avoid that by
                # wrapping in a list.
                self.contents = [contents]
            else:
                self.contents = contents
        self.attributes = attributes

    def to_string(self):
        yield '<{}'.format(self.tag)
        for (a, v) in self.attributes.items():
            yield ' {}="{}"'.format(a, v)
        if self.contents is None:
            yield '/>'
        else :
            yield '>'
            for c in self.contents:
                if isinstance(c, TAG):
                    for i in c.to_string():
                        yield i
                else:
                    yield c
            yield '</{}>'.format(self.tag)

    def to_string_b(self, builder=None):
        global isinstance, basestring
        def isbasestring(c, isinstance=isinstance, basestring=basestring):
            # some inlining
            return isinstance(c, basestring)
        if builder is None:
            iamroot = True
            builder = ET.TreeBuilder()
        else:
            iamroot = False #don't close+flush the builder
        builder.start(self.tag, self.attributes)
        if self.contents is not None:
            for c in self.contents:
                if (isbasestring(c)):
                    builder.data(c)
                else:
                    for _ in c.to_string_b(builder):
                        pass
        builder.end(self.tag)
        # this is a yield *ONLY* to preserve the interface
        # of to_string()! if you can change the calling
        # code easily, use return instead!
        if iamroot:
            yield ET.tostring(builder.close())


class H1(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'H1', contents, **attributes)
class H2(TAG):
    def __init__(self, contents=None, **attributes):
        TAG.__init__(self, 'H2', contents, **attributes)    

tree = H1(["This is some ", H2("test input", id="abcd", cls="efgh"), " and trailing text"])

print ''.join(tree.to_string())
print ''.join(tree.to_string_b())

Upvotes: 1

mattbasta
mattbasta

Reputation: 13709

Preface: This is a terrible way to generate HTML, but if you're going to do it, you'd might as well do it the best way possible.

One thing that python is exceptionally good at is string formatting. If you're concatting lots of tiny strings, you're killing your performance from the get-go. Your to_string() method should look more like this:

def to_string(self):
    return """<{tag}{attributes}>{content}</{tag}>""".format(
        tag=self.tag,
        attributes=' '.join('%s="%s"' % (attr, val) for
                            attr, val in self.attributes),
        content=''.join(
            (n if isinstance(n, basestring) else n.to_string()) for
            n in self.contents))

Take note of a few things that I did there:

  1. This is Python, not Java. Stack frames are expensive, so minimize function and method calls.
  2. If you don't need a function to abstract a property, don't do it. I.e.: you don't need attr_g (except maybe to do escaping, but you can do that when you're putting the data in instead).
  3. Do all of your string formatting on the same string! Having a single string formatting operation for a tiny string and then yielding it to be concatted is a huge waste.
  4. Don't use a generator for this. Every time you yield, you're mussing around with the instruction pointer, which is going to inherently slow things down.

Other pointers:

  • You're inheriting from object, so use the super() function.
  • Don't waste code by writing constructors to declare the tag type:

    class TAG(object):
        def __init__(self, contents=None, **attributes):
            self.contents = contents
            self.attributes = attributes
    
    class H1(TAG):
        tag = 'H1'
    
    class H2(TAG):
        tag = 'H2'
    
  • You might have some success with StringIO objects if you're doing a lot of this. They'll let you build your tags and .write() them in. You can think of them as .Net StringBuffers or Java's StringBuilders.

Upvotes: 3

Related Questions