Reputation: 11489
I'm currently supporting a legacy python application that generates all the html through creating individual tag objects.
We have a parent TAG class
class TAG(object):
def __init__(self, tag="TAG", contents=None, **attributes):
self.tag = tag
self.contents = contents
self.attributes = attributes
So every other tag inherits from TAG
class H1(TAG):
def __init__(self, contents=None, **attributes):
TAG.__init__(self, 'H1', contents, **attributes)
class H2(TAG):
def __init__(self, contents=None, **attributes):
TAG.__init__(self, 'H2', contents, **attributes)
The main TAG class has a to_string method that's something along the lines of
def to_string(self):
yield '<{}'.format(self.tag)
for (a, v) in self.attr_g():
yield ' {}="{}"'.format(a, v)
if self.NO_CONTENTS:
yield '/>'
else :
yield '>'
for c in self.contents:
if isinstance(c, TAG):
for i in c.str_g():
yield i
else:
yield c
yield '</{}>'.format(self.tag)
We basically write out the result of the to_string method.
The issue comes to pages where there's a lot of TAGS being generated and is big enough to create a performance hit.
Are there any quick wins that I can do to make it perform better?
Upvotes: 0
Views: 132
Reputation: 31621
@mattbasta has the right idea here. However, I want to propose something a bit different: implement to_string
using cElementTree.TreeBuilder
. I don't know if the super-fast serialization of ElementTree will win out against overhead of creating an ElementTree.
Here is a wonky TAG
class with a to_string_b()
method that makes use of some micro-optimizations and uses a TreeBuilder to build the tree. (A possibly-important difference between your to_string()
and TreeBuilder is that TreeBuilder will always escape output for XML, whereas yours will not.)
import xml.etree.cElementTree as ET
class TAG(object):
def __init__(self, tag="TAG", contents=None, **attributes):
self.tag = tag
# this is to insure that `contents` always has a uniform
# type.
if contents is None:
self.contents = []
else:
if isinstance(contents, basestring):
# I suspect the calling code passes in a string as contents
# in the common case, so this means that each character of
# the string will be yielded one-by-one. let's avoid that by
# wrapping in a list.
self.contents = [contents]
else:
self.contents = contents
self.attributes = attributes
def to_string(self):
yield '<{}'.format(self.tag)
for (a, v) in self.attributes.items():
yield ' {}="{}"'.format(a, v)
if self.contents is None:
yield '/>'
else :
yield '>'
for c in self.contents:
if isinstance(c, TAG):
for i in c.to_string():
yield i
else:
yield c
yield '</{}>'.format(self.tag)
def to_string_b(self, builder=None):
global isinstance, basestring
def isbasestring(c, isinstance=isinstance, basestring=basestring):
# some inlining
return isinstance(c, basestring)
if builder is None:
iamroot = True
builder = ET.TreeBuilder()
else:
iamroot = False #don't close+flush the builder
builder.start(self.tag, self.attributes)
if self.contents is not None:
for c in self.contents:
if (isbasestring(c)):
builder.data(c)
else:
for _ in c.to_string_b(builder):
pass
builder.end(self.tag)
# this is a yield *ONLY* to preserve the interface
# of to_string()! if you can change the calling
# code easily, use return instead!
if iamroot:
yield ET.tostring(builder.close())
class H1(TAG):
def __init__(self, contents=None, **attributes):
TAG.__init__(self, 'H1', contents, **attributes)
class H2(TAG):
def __init__(self, contents=None, **attributes):
TAG.__init__(self, 'H2', contents, **attributes)
tree = H1(["This is some ", H2("test input", id="abcd", cls="efgh"), " and trailing text"])
print ''.join(tree.to_string())
print ''.join(tree.to_string_b())
Upvotes: 1
Reputation: 13709
Preface: This is a terrible way to generate HTML, but if you're going to do it, you'd might as well do it the best way possible.
One thing that python is exceptionally good at is string formatting. If you're concatting lots of tiny strings, you're killing your performance from the get-go. Your to_string()
method should look more like this:
def to_string(self):
return """<{tag}{attributes}>{content}</{tag}>""".format(
tag=self.tag,
attributes=' '.join('%s="%s"' % (attr, val) for
attr, val in self.attributes),
content=''.join(
(n if isinstance(n, basestring) else n.to_string()) for
n in self.contents))
Take note of a few things that I did there:
attr_g
(except maybe to do escaping, but you can do that when you're putting the data in instead).Other pointers:
object
, so use the super()
function.Don't waste code by writing constructors to declare the tag type:
class TAG(object):
def __init__(self, contents=None, **attributes):
self.contents = contents
self.attributes = attributes
class H1(TAG):
tag = 'H1'
class H2(TAG):
tag = 'H2'
You might have some success with StringIO
objects if you're doing a lot of this. They'll let you build your tags and .write()
them in. You can think of them as .Net StringBuffer
s or Java's StringBuilder
s.
Upvotes: 3