Marshmellow1328
Marshmellow1328

Reputation: 1265

Horrible Performance Parsing XHTML File with Doctype as XML Document

When I parse this xhtml file as xml, it takes approximately 2 minutes to do the parsing on such a simple file. I have found that if I remove the doctype declaration, it parses nigh instantaneously. What is wrong that is causing this file to take so long to parse?

Java Example

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );

XHTML Example

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:ex="http://www.example.com/schema/v1_0_0">
    <head><title>Test</title></head>
    <body>
        <h1>Test</h1>
        <p>Hello, World!</p>
        <p><ex:test>Text</ex:test></p>
    </body>
</html>

Thanks

Edit: Solution

To actually fix the problem based on the information provided about why it was happening in the first place, I did these basic steps:

  1. Downloaded the DTD-related files to a src/main/resources folder
  2. Created a custom EntityResolver to read these files from the classpath
  3. Told my DocumentBuilder to use my new EntityResolver

I referenced this SO answer in doing so: how to validate XML using java?

New EntityResolver

import java.io.IOException;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class LocalXhtmlDtdEntityResolver implements EntityResolver {

    /* (non-Javadoc)
     * @see org.xml.sax.EntityResolver#resolveEntity(java.lang.String, java.lang.String)
     */
    @Override
    public InputSource resolveEntity( String publicId, String systemId )
            throws SAXException, IOException {
        String fileName = systemId.substring( systemId.lastIndexOf( "/" ) + 1 );    
        return new InputSource( 
                getClass().getClassLoader().getResourceAsStream( fileName ) );
    }

}

How to use new EntityResolver:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
bob.setEntityResolver( new LocalXhtmlDtdEntityResolver() );
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );

Upvotes: 3

Views: 2603

Answers (2)

Michael Kay
Michael Kay

Reputation: 163322

Actually, you're lucky you got the documents at all. W3C is deliberately unresponsive to requests for these documents because they can't handle the volume of requests. You need to provide the parser with a local copy.

The usual way to do this in the Java world is using Apache/Oasis catalog resolvers.

The latest version of Saxon has built-in knowledge of these commonly-used DTDs and entity files, and if you allow Saxon to supply your XML parser it will automatically be configured to use the local copies. You can probably take advantage of this even if you're not using XSLT or XQuery to process the data: just create a Saxon Configuration object and call its getSourceParser() method to get your XMLReader.

(Perhaps this would be a good time to wean yourself off DOM as well. Of all the many choices for processing XML in Java, DOM is probably the worst. If you must use a low-level navigational API, choose a decent one like JDOM or XOM.)

Upvotes: 3

Charlie
Charlie

Reputation: 7349

Java is downloading the specified DTD and its and included files in order to validate that your xhtml file obeys the specified DTD. Using Charles proxy I recorded the following requests taking the specified amounts to load:

Upvotes: 3

Related Questions