Reputation: 93
Followup question to Big XML File:
First thanks a lot for yours answers. After… what I do wrong? This is my class which uses SAX:
public class SAXParserXML extends DefaultHandler {
public static void ParcourXML() {
DefaultHandler handler = new SAXParserXML();
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
String URI = "dblp.xml";
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(URI,handler);
} catch (Throwable t) {
t.printStackTrace ();
}
}
public void startElement (String namespaceURI,String simpleName,String qualifiedName,Attributes attrs) throws SAXException {
}
public void endElement (String namespaceURI,String simpleName,String qualifiedName) throws SAXException {
}
}
You can see that I do nothing with my XML file but it gives this error:
java.lang.OutOfMemoryError: Java heap space
at com.sun.org.apache.xerces.internal.util.XMLStringBuffer.append(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.refresh(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.invokeListeners(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(Unknown Source)
at SAXParserXML.ParcourXML(SAXParserXML.java:30)
at Main.main(Main.java:28)
I tried also Stax…the same error… what can I do? Also I increased the Java heap size up to 1260M
java -Xmx1260M SAXParserXML
the XML file has this form:
<dblp>
<incollection>
<author>... </author>
....
<author>... </author>
#other tags-i'm interested only by <author>#
...
</incollection>
<incollection>
# the same thing#
</incollection>
....
</dblp>
You can find the original file: http://dblp.uni-trier.de/xml/
Thanks
Upvotes: 5
Views: 1782
Reputation: 116502
It sounds like one of text segments (or CDATA, processing instruction, or comment) in XML file is very long, and parser does not split it into multiple segments. Or it could be that parser fails to parse DOCTYPE declaration properly: if so, it might try reading all xml content as if it was part of DTD subset.
But that's just speculation. You mentioned that you have tried Stax: which implementation? JDK 1.6 comes with Sun Sjsxp. But you could also try Woodstox (http://woodstox.codehaus.org), which often handles things in bit more robust way. So if you are not using Woodstox, you could see what happens. It does split text segments into smaller chunks unless you force text coalescing (not default).
Oh and just in case you were testing using Stax reference implementation (http://stax.codehaus.org); it is unfortunately known to be very buggy. So that could cause problems. Both Sjsxp and Woodstox are much better choices with Stax.
Upvotes: 0
Reputation: 118631
Well, given:
public class Main {
/**
* @param args the command line arguments
*/
public static void main(String argv[]) {
Writer out;
// Use an instance of ourselves as the SAX event handler
Echo handler = new Echo();
// Use the default (non-validating) parser
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
// Set up output stream
out = new OutputStreamWriter(System.out, "UTF8");
// Parse the input
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(new File("/tmp/dblp.xml"), handler);
} catch (Throwable t) {
t.printStackTrace();
}
System.out.println("Incollections = " + handler.cnt);
System.exit(0);
}
static class Echo extends DefaultHandler {
public int cnt = 0;
@Override
public void startElement(String namespaceURI,
String sName, // simple name
String qName, // qualified name
Attributes attrs)
throws SAXException {
if (qName.equals("incollection")) {
cnt = cnt + 1;
}
}
}
}
This works for me under Java 5, but I do get the OOM under Java 6.
I run it like this:
java -DentityExpansLimit=512000 -jar xmltest.jar
And it prints:
Incollections = 8353
Which is convenient:
grep "<incollection" /tmp/dblp.xml | wc -l
8353
So, FYI, data point, etc.
Upvotes: 2
Reputation: 35341
I don't know the correct terminology for this, but how "deep" does your XML go? For example, the "author" tag in your example is 2 elements deep. If you have tags that are really really deep, maybe that's why you're having memory issues?
Upvotes: 0
Reputation: 346290
There seems to be a problem with HTML entites in your code, namely "José
" in the first block. At least my browser tells me there's a problem with it when I open the file, and XMLEntityScanner
shows up in the stack trace. I'm not an XML expert, but could it be that HTML entities are not in fact defined for XML in general?
Edit Yup, that's it. According to Wikipedia, entities like é
are defined in the HTML DTD; XML has only a very small number of predefined entities.
Upvotes: 0
Reputation: 86502
There's a bug for Java 1.6 which shows the exact same stack trace, and it's unfixed as of now. Newer Xerces versions seem to be fine.
For documents this large, which still contain a fair amount of structure, you could think about using pull-parsing, i.e. parsing of partial structures, for instance with StAX.
Upvotes: 6