Reputation: 210445
When I run
from xml.etree import ElementTree
tree = ElementTree.fromstring('<foo bar=""baz=""></foo>')
I get
xml.etree.ElementTree.ParseError
: not well-formed (invalid token): line 1, column 11
This is due to the lack of space between ""
and baz
.
I'm encountering this problem in XML files provided to me by a third party.
Is there any way to make ElementTree
be a little less pedantic about the spacing and parse it as if there was a space?
Upvotes: 4
Views: 348
Reputation: 210445
Since it sounds like a solution may not be withing sight...
Until a better solution comes along, here's a hacky workaround for the next poor soul...
def xml_fixup(s): # give it the XML as a tring
flags = re.DOTALL
pat_quotes = '\"[^\"]*\"|\'[^\']*\''
re_quotes = re.compile('(%s)([^>\\s])' % pat_quotes, flags) # TODO: cache
re_pieces = re.compile('([^<]+)|(<)((?:[^\"\'>]+|%s)*)(>)' % pat_quotes, flags) # TODO: cache
pieces = re_pieces.findall(s)
return s[:0].join(map(lambda m: m[0] or m[1] + re_quotes.sub('\\1 \\2', m[2]) + m[3], pieces))
print(xml_fixup('<foo bar=""baz=""></foo>')) # <foo bar="" baz=""></foo>
Brownie points if you spot bugs in this!
Upvotes: 2