Reputation: 125
I'm trying to check validity of XML files (against DTDs, entities, Processing instructions, namespaces) in Python 3.4.
Looking at Python docs the default underlying parser for three Python XML modules pyexpat, ELementTree and SAX is expat. On Pyexpat page (https://docs.python.org/3.4/library/pyexpat.html?highlight=pyexpat#module-xml.parsers.expat) says that non-validating version of expat parser is used: "The xml.parsers.expat module is a Python interface to the Expat non-validating XML parser." Yet at the same time when you look into SAX documentation in Python you see all these handler functions for enabling DTD validation etc. How the heck do you make them work?
However, according to this post Parsing XML Entity with python xml.sax SAX can validate. Obviously with expat as a parser.
I have reused the code from this post and can't get it to work I get error saying expat does not support validation: "File "/usr/lib/python3.4/xml/sax/expatreader.py", line 149, in setFeature "expat does not support validation") xml.sax._exceptions.SAXNotSupportedException: expat does not support validation". In the post Python 2.5 was used, so maybe SAX has changed since then...
This is the code:
import xml.sax
from xml.sax import handler, make_parser, parse
import os
import collections
class SaxParser():
# initializer with directory part as argument
def __init__(self, dir_path):
self.dir_path = dir_path
def test_each_file(self, file_path):
# ensure full file name is shown
rev = file_path[::-1] # reverse string file_path to access position of "/"
file = file_path[-rev.index("/"):]
try:
f = open(file_path, 'r', encoding="ISO-8859-1") # same as "latin-1" encoding
# see this for enabling validation:
# https://stackoverflow.com/questions/6349513/parsing-xml-entity-with-python-xml-sax
parser = make_parser() # default parser is expat
parser.setContentHandler(handler.ContentHandler())
parser.setFeature(handler.feature_namespaces,True)
parser.setFeature(handler.feature_validation,True)
parser.setFeature(handler.feature_external_ges, True)
parser.parse(f)
f.close()
return (file, "OK")
except xml.sax.SAXParseException as PE:
column = PE.getColumnNumber()
line = PE.getLineNumber()
msg = PE.getMessage()
value = msg + " " + str(line) + " " + str(column)
return (file, value)
except ValueError:
return (file, "ValueError. DTD uri not found.") # that can happen
def test_directory_sax(self, dir_path):
tuples = []
for ind, file in enumerate(os.listdir(dir_path), 1):
if file.endswith('.xml'):
tuples.append(self.test_each_file(dir_path + file))
# convert into dict and sort it by key (file number)
dict_of_errors = dict(tuples)
dict_of_errors = collections.OrderedDict(sorted(dict_of_errors.items()))
return dict_of_errors
# ========================================================================
# INVOKE TESTS FOR SINGLE SPECIFIED DIRECTORY THAT CONTAINS TEST FILES
# ========================================================================
path = # path to directory where xml file is. - not the filepath!
single_sax = SaxParser(path)
print('============================================================')
print('TEST FOR SAX parser FOR DIRECTORY ' + path)
print('============================================================\n')
print(single_sax.test_directory_sax(path))
and test xml file (should produce validation error):
<!DOCTYPE root [
<!ATTLIST root
id2 ID "x23"
>
]>
<!-- an ID attribute must have a declared default
of #IMPLIED or #REQUIRED
-->
<root/>
How do I check validity? For either one of three XML modules? A simple example would do.
Thanks.
Upvotes: 1
Views: 1538
Reputation: 16790
If you look into the source file, you'll see that the xml.sax.handler.feature_validation
is not really doing anything but raising this exception:
def setFeature(self, name, state):
# ...
elif name == feature_validation:
if state:
raise SAXNotSupportedException(
"expat does not support validation")
#...
I would suggest using lxml
to do this. An example would be like this:
from lxml import etree
from cStringIO import StringIO
# from io import StringIO (py3)
f = StringIO('<!ATTLIST root id2 ID "x23">')
dtd = etree.DTD(f)
root = etree.XML('<root/>')
print(dtd.validate(root))
print(dtd.error_log.filter_from_errors()[0])
Upvotes: 1