user1919035
user1919035

Reputation: 227

Huge XML file to text files

I have a huge XML file(15 GB). I want to convert a 'text' tag in XML file to a single page.

Sample XML file:

<root>
    <page>
        <id> 1 </id>
        <text>
        .... 1000 to 50000 lines of text
        </text>
    </page>
    ... Like wise 2 Million `page` tags
</root>

I've initially used DOM parser, but it throws JAVA OUT OF MEMORY(Valid). Now, I've written JAVA code using STAX. It works good, but performance is really slow.

This is the code I've written:

 XMLEventReader xMLEventReader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(filePath));
    while(xMLEventReader.hasNext()){
      xmlEvent = xMLEventReader.nextEvent();

    switch(xmlEvent.getEventType()){
    case XMLStreamConstants.START_ELEMENT:
    if( element == "text")
      isText    = true;
    break;
    case XMLStreamConstants.CHARACTERS:
      chars = (Characters) xmlEvent;
      if(! (chars.isWhiteSpace() || chars.isIgnorableWhiteSpace()))
               if(isText)
              pageContent += chars.getData() + '\n';
      break;
    case XMLStreamConstants.END_ELEMENT:
      String elementEnd = (((EndElement) xmlEvent).getName()).getLocalPart();
      if( elementEnd == "text" )
      {
          createFile(id, pageContent);
          pageContent = "";
          isText = false;
      }
      break;
    }
}

This code is working good.(Ignore about any minor errors). According to my understanding, XMLStreamConstants.CHARACTERS iterates for each and everyline of text tag. If TEXT tag has 10000 lines in it, XMLStreamConstants.CHARACTERS iterates for next 10000 lines. Is there any better way to improve the performance..?

Upvotes: 9

Views: 948

Answers (6)

Richard Miskin
Richard Miskin

Reputation: 1260

I can see a few possible solutions things that might help you out:

  1. Use a BufferedInputStream rather than a simple FileInputStream to reduce the number of disk operations
  2. Consider using a StringBuilder to create your pageContent rather than String catenation.
  3. Increase your Java heap (-Xmx option) in case you're memory bound with your 2GB example.

It can be quite interesting in cases like this to hook up a code profiler (e.g. Java VisualVM) as you are then able to see exactly what method calls are being slow within your code. You can then focus optimisations appropriately.

Upvotes: 4

Jason C
Jason C

Reputation: 40335

What is pageContent? It appears to be a String. One easy optimization to make right away would be to use a StringBuilder instead; it can append strings without having to make completely new copies of the strings like Strings += does (you can also construct it with an initial reserved capacity to reduce memory reallocations and copies if you have an idea of the length to begin with).

Concatenating Strings is a slow operation because strings are immutable in Java; each time you call a += b it must allocate a new string, copy a into it, then copy b into the end of it; making each concatenation O(n) wrt. total length of the two strings. Same goes for appending single characters. StringBuilder on the other hand has the same performance characteristics as an ArrayList when appending. So where you have:

pageContent += chars.getData() + '\n';

Instead change pageContent to a StringBuilder and do:

pageContent.append(chars.getData()).append('\n');

Also if you have a guess on the upper bound of the length of one of these strings, you can pass it to the StringBuilder constructor to allocate an initial amount of capacity and reduce the chance of a memory reallocation and full copy having to be done.

Another option, by the way, is to skip the StringBuilder altogether and write your data directly to your output file (presuming you're not processing the data somehow first). If you do this, and performance is I/O-bound, choosing an output file on a different physical disk can help.

Upvotes: 1

xlm
xlm

Reputation: 7594

If parsing of XML file is the main issue, consider using VTD-XML, namely the extended version as it supports files up to 256GB.

As it is based on non-extractive document parsing, it is quite memory efficient and using it to querying/extract text using XPath is also very fast. You can read more details about this approach and VTD-XML from here.

Upvotes: 2

user207421
user207421

Reputation: 310893

  1. Use a BufferedInputStream around the FileInputStream.
  2. Don't concatenate the data. It's a complete waste of time and space, potentially a lot of space. Write it out immediately you get it. Use a BufferedWriter around a FileWriter for that.

Upvotes: 0

Shriram
Shriram

Reputation: 4411

Try to parse with SAX parser because DOM will try to parse the entire content and place it in memory. Because of this you are getting Memory exception. SAX parser will not parse the entire content at one stretch.

Upvotes: 1

Hirak
Hirak

Reputation: 3649

You code looks standard. However, could you try wrapping your FileInputStream into a BufferedInputStream and let us know if that helps? BufferedInputstream saves you few native calls to the OS, so there are chances of better performance. You have to play around with the Buffer size to get the optimum performance. Set some size depending on your JVM memory allocation.

Upvotes: 0

Related Questions