Reputation: 1265
When I parse this xhtml file as xml, it takes approximately 2 minutes to do the parsing on such a simple file. I have found that if I remove the doctype declaration, it parses nigh instantaneously. What is wrong that is causing this file to take so long to parse?
Java Example
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );
XHTML Example
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:ex="http://www.example.com/schema/v1_0_0">
<head><title>Test</title></head>
<body>
<h1>Test</h1>
<p>Hello, World!</p>
<p><ex:test>Text</ex:test></p>
</body>
</html>
Thanks
Edit: Solution
To actually fix the problem based on the information provided about why it was happening in the first place, I did these basic steps:
I referenced this SO answer in doing so: how to validate XML using java?
New EntityResolver
import java.io.IOException;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class LocalXhtmlDtdEntityResolver implements EntityResolver {
/* (non-Javadoc)
* @see org.xml.sax.EntityResolver#resolveEntity(java.lang.String, java.lang.String)
*/
@Override
public InputSource resolveEntity( String publicId, String systemId )
throws SAXException, IOException {
String fileName = systemId.substring( systemId.lastIndexOf( "/" ) + 1 );
return new InputSource(
getClass().getClassLoader().getResourceAsStream( fileName ) );
}
}
How to use new EntityResolver:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
bob.setEntityResolver( new LocalXhtmlDtdEntityResolver() );
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );
Upvotes: 3
Views: 2603
Reputation: 163322
Actually, you're lucky you got the documents at all. W3C is deliberately unresponsive to requests for these documents because they can't handle the volume of requests. You need to provide the parser with a local copy.
The usual way to do this in the Java world is using Apache/Oasis catalog resolvers.
The latest version of Saxon has built-in knowledge of these commonly-used DTDs and entity files, and if you allow Saxon to supply your XML parser it will automatically be configured to use the local copies. You can probably take advantage of this even if you're not using XSLT or XQuery to process the data: just create a Saxon Configuration object and call its getSourceParser() method to get your XMLReader.
(Perhaps this would be a good time to wean yourself off DOM as well. Of all the many choices for processing XML in Java, DOM is probably the worst. If you must use a low-level navigational API, choose a decent one like JDOM or XOM.)
Upvotes: 3
Reputation: 7349
Java is downloading the specified DTD and its and included files in order to validate that your xhtml file obeys the specified DTD. Using Charles proxy I recorded the following requests taking the specified amounts to load:
Upvotes: 3