Reputation: 93

RE: Big XML file

Followup question to Big XML File:

First thanks a lot for yours answers. After… what I do wrong? This is my class which uses SAX:

public class SAXParserXML extends DefaultHandler {
  public static void ParcourXML() {

      DefaultHandler handler = new SAXParserXML();
      SAXParserFactory factory = SAXParserFactory.newInstance();
      try {
          String URI = "dblp.xml";
          SAXParser saxParser = factory.newSAXParser();
          saxParser.parse(URI,handler);
      } catch (Throwable t) {
     t.printStackTrace ();
       }
  }



  public void startElement (String namespaceURI,String simpleName,String qualifiedName,Attributes attrs) throws SAXException {
  }
  public void endElement (String namespaceURI,String simpleName,String qualifiedName) throws SAXException {

  }
}

You can see that I do nothing with my XML file but it gives this error:

java.lang.OutOfMemoryError: Java heap space
    at com.sun.org.apache.xerces.internal.util.XMLStringBuffer.append(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.refresh(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.invokeListeners(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at javax.xml.parsers.SAXParser.parse(Unknown Source)
    at SAXParserXML.ParcourXML(SAXParserXML.java:30)
    at Main.main(Main.java:28)

I tried also Stax…the same error… what can I do? Also I increased the Java heap size up to 1260M

java -Xmx1260M SAXParserXML

the XML file has this form:

<dblp> 
   <incollection> 
      <author>... </author> 
      .... 
      <author>... </author> 
      #other tags-i'm interested only by <author>#
      ... 
   </incollection> 
   <incollection> 
   # the same thing# 
   </incollection> 
   .... 
</dblp>

You can find the original file: http://dblp.uni-trier.de/xml/

Thanks

Upvotes: 5

Answers (5)

StaxMan

Reputation: 116502

It sounds like one of text segments (or CDATA, processing instruction, or comment) in XML file is very long, and parser does not split it into multiple segments. Or it could be that parser fails to parse DOCTYPE declaration properly: if so, it might try reading all xml content as if it was part of DTD subset.

But that's just speculation. You mentioned that you have tried Stax: which implementation? JDK 1.6 comes with Sun Sjsxp. But you could also try Woodstox (http://woodstox.codehaus.org), which often handles things in bit more robust way. So if you are not using Woodstox, you could see what happens. It does split text segments into smaller chunks unless you force text coalescing (not default).

Oh and just in case you were testing using Stax reference implementation (http://stax.codehaus.org); it is unfortunately known to be very buggy. So that could cause problems. Both Sjsxp and Woodstox are much better choices with Stax.

Upvotes: 0

Will Hartung

Reputation: 118631

Well, given:

public class Main {

    /**
     * @param args the command line arguments
     */
    public static void main(String argv[]) {
        Writer out;

        // Use an instance of ourselves as the SAX event handler
        Echo handler = new Echo();
        // Use the default (non-validating) parser
        SAXParserFactory factory = SAXParserFactory.newInstance();
        try {
            // Set up output stream
            out = new OutputStreamWriter(System.out, "UTF8");
            // Parse the input 
            SAXParser saxParser = factory.newSAXParser();
            saxParser.parse(new File("/tmp/dblp.xml"), handler);
        } catch (Throwable t) {
            t.printStackTrace();
        }
        System.out.println("Incollections = " + handler.cnt);
        System.exit(0);
    }

    static class Echo extends DefaultHandler {
        public int cnt = 0;
        @Override
        public void startElement(String namespaceURI,
                String sName, // simple name
                String qName, // qualified name
                Attributes attrs)
                throws SAXException {
            if (qName.equals("incollection")) {
                cnt = cnt + 1;
            }
        }
    }
}

This works for me under Java 5, but I do get the OOM under Java 6.

I run it like this:

java -DentityExpansLimit=512000 -jar xmltest.jar

And it prints:

Incollections = 8353

Which is convenient:

grep "<incollection" /tmp/dblp.xml | wc -l
8353

So, FYI, data point, etc.

Upvotes: 2

Michael

Reputation: 35341

I don't know the correct terminology for this, but how "deep" does your XML go? For example, the "author" tag in your example is 2 elements deep. If you have tags that are really really deep, maybe that's why you're having memory issues?

Upvotes: 0

Michael Borgwardt

Reputation: 346290

There seems to be a problem with HTML entites in your code, namely "José" in the first block. At least my browser tells me there's a problem with it when I open the file, and XMLEntityScanner shows up in the stack trace. I'm not an XML expert, but could it be that HTML entities are not in fact defined for XML in general?

Edit Yup, that's it. According to Wikipedia, entities like é are defined in the HTML DTD; XML has only a very small number of predefined entities.

Upvotes: 0

Torsten Marek

Reputation: 86502

There's a bug for Java 1.6 which shows the exact same stack trace, and it's unfixed as of now. Newer Xerces versions seem to be fine.

For documents this large, which still contain a fair amount of structure, you could think about using pull-parsing, i.e. parsing of partial structures, for instance with StAX.

Upvotes: 6

RE: Big XML file

Answers (5)

Related Questions