Paddy3118
Paddy3118

Reputation: 4772

Python ElementTree does not like colon in name of processing instruction

The following code:

import xml.etree.ElementTree as ET

xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
    <?LazyComment Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>'''

root = ET.fromstring(xml)

xml2 = xml.replace('LazyComment ', 'LazyComment:')
print(xml2)
try:
    root2 = ET.fromstring(xml2)
except ET.ParseError:
    print("\nERROR in xml2!!!\n")

xml3 = xml2.replace('testCaseConfig', 'testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/"', 1)
print(xml3)
try:
    root3 = ET.fromstring(xml3)
except ET.ParseError:
    print("\nERROR in xml3!!!\n")
    raise

Gives this output:

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>

ERROR in xml2!!!

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>

ERROR in xml3!!!

Traceback (most recent call last):
  File "C:\Users\Paddy3118\Google Drive\Code\elementtree_error.py", line 30, in <module>
    root3 = ET.fromstring(xml3)
  File "C:\Anaconda3\envs\Py3.5\lib\xml\etree\ElementTree.py", line 1333, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 17

I searched and found this Q that pointed to other resources that I read.

It seems that the '?' makes it a processing instruction whose tag name can include colons. Without the '?' then a colon in a name indicates namespace and one of the answers stated that defining the namespace should make things work.

Combining '?' and ':' though causes issues with ElementTree.

I am given xml files of this type that are used by other tools that do parse it OK and want to process the files myself using Python. Any ideas?

Thanks.

Upvotes: 2

Views: 2918

Answers (2)

Parfait
Parfait

Reputation: 107747

According to the W3C Extensible Markup Language 1.0 Specifications under Common Syntactic Constructs:

The Namespaces in XML Recommendation [XML Names] assigns a meaning to names containing colon characters. Therefore, authors should not use the colon in XML names except for namespace purposes, but XML processors must accept the colon as a name character.

And further in the W3C XPath 1.0 note on Processing Instruction nodes:

A processing instruction has an expanded-name: the local part is the processing instruction's target; the namespace URI is null.

Altogether, <?LazyComment:Blah de blah/?> is an invalid processing instruction as colons is used to reference namespace URIs and for processing instructions that part is null or empty. Therefore, Python's XML processor complains that using such an instruction does not render a well-formed XML.

Also, reconsider such tools that are generating such invalid processing instructions as they are not handling valid XML documents. Possibly, such tools are treating XML files as text documents (similar to the way you were able to replace the string representation of XML but would not have been able to append an instruction using etree).

Upvotes: 2

mowcow
mowcow

Reputation: 81

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">

Is invalid XML. You can't have attributes in the closing tag. The last line should be just </testCaseConfig>

Also comments are written like this

<!-- this is a comment -->

Upvotes: 0

Related Questions