Reputation: 583
I'm working on an XML parser that checks well-formedness. According to the XML Spec, such an "xml processor" is required process the dtd for the purpose of grabbing entity references and attribute list definitions (ie to make a symbol table for the purpose of resolving references, normalizing attribute values, and providing default attribute values). Does this imply passing the entire DTD onto the application if I know the application is itself going to be outputting XML?
If not, what is the standard best practice as far as preservation of the DTD in a fully processed XML document? My instinct is to either pass no DTD and an XML declaration that says standalone=no
or to pass on a stripped-down DTD that has been stripped of everything but its name and declarations of the external entities actually referenced in the document.
Upvotes: 1
Views: 312
Reputation: 8058
No, processing an external DTD does not necessarily require incorporating the full contents of that DTD into your output. Among other things, the output isn't always the same kind of document as the input...
However, this does mean that you have to make a decision about how to handle entity references and default attribute values. One approach (a) is simply to expand them and pass their contents to the output document. The other would be to ensure that the output document either (b) includes at least the declarations for those pieces of information in its internal DTD or (c) references an external DTD which provides those definitions (possibly the same one the source document did, if the output document is of a type compatible with that DTD).
Option (a), expanding everything so you're no longer dependent upon the DTD for defaults and macros, is actually the most common solution for general-purpose XML handling. If your tool is working with a specific set of DTDs, option (c) would be an appropriate answer.
Note that similar answers apply for XML Schemas. Also note that DTDs, because they are not really compatible with XML Namespaces, are on the edge of being defunct; namespaces are just too darned useful for serious XML processing. All modern XML parsers should support Schemas; I would recommend DTDs these days only if you absolutely require backward compatability with the earliest generations of XML code. (The one thing DTDs do that schemas don't is Parsed Entities... but realistically, those are used EXTREMELY rarely in anything but hand-constructed documents.)
Numeric Character References, or the few named Character References (& and < most notably) are built into the XML language and parsers, so you don't need DTD handling to support those.
.....
By the way: Why the heck are you rewriting an XML parser from scratch? Unless you're specifically doing research in parser optimization or something of that sort, or are doing this as a class assignment, there's no reason not to use one of the many off-the-shelf parsers; at this point I think they exist in just about every widely available programming language, and they're likely to have put a lot more work into optimization and handling the subtleties of XML than you have or will.
If you really do need to reinvent this particular wheel, I HIGHLY recommend spending some time with The Annotated XML Specification. Tim Bray did a WONDERFUL job of going through the XML 1.0 REC and explaining exactly what it all means and why some of the less-obvious decisions were made the way they were. Unfortunately, that required enough effort -- and enough inside knowledge of the discussions in the working group -- that nobody has been willing to redo it for XML 1.1 or for any of the other W3C specs.
Upvotes: 1