Weber
Weber

Reputation: 89

how to parse large complex xml

I need to parse a large complex xml and write to a Flat file, can you give some advise?

File size: 500MB Record count: 100K XML structure:

<Msg>

    <MsgHeader>
        <!--Some of the fields in the MsgHeader need to be map to a java object-->
    </MsgHeader>

    <GroupA> 
        <GroupAHeader/>
        <!--Some of the fields in the GroupAHeader need to be map to a java object--> 
        <GroupAMsg/>
        <!--50K records--> 
        <GroupAMsg/> 
        <GroupAMsg/> 
        <GroupAMsg/> 
    </GroupA>

    <GroupB> 
        <GroupBHeader/> 
        <GroupBMsg/>
        <!--50K records--> 
        <GroupBMsg/> 
        <GroupBMsg/> 
        <GroupBMsg/> 
    </GroupB>

</Msg>

Upvotes: 5

Views: 6017

Answers (6)

mfe
mfe

Reputation: 1208

You can use Declarative Stream Mapping (DSM) stream parsing library. It can process both JSON and XML. It doesn't load XML file in to memory. DSM only process data that you defined in YAML or JSON config.

You can call method while reading XML.This allows you to process XML partially. You can deserialzie this partially read XML data to Java object.

Even you can use it to read in multiple thread.

You can find good example in this Answer

Unmarshalling XML to three lists of different objects using STAX Parser

JAVA - Best approach to parse huge (extra large) JSON file (same for XML)

Upvotes: 0

roemer
roemer

Reputation: 792

If you accept an solution aside JAXB/Spring Batch, you may want to have a look at the SAX Parser.

This is a more event-oriented way of parsing XML files and may be a good approach when you want to directly write into the target file while parsing. The SAX Parser is not reading the whole xml content into memory but triggers methods when it enconters elements in the inputstream. As far as I have experienced it, this is a very memory-efficient way of processing.

In comparison to your Stax-Solution, SAX 'pushes' the data into your application - this means that you have to maintain the state (like in which tag you are corrently), so you have to keep track of your current location. I'm not sure if that is something you really require

The following example reads in an xml file in your structure and prints out all text within GroupBMsg-Tags:

import java.io.FileReader;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;

public class SaxExample implements ContentHandler
{
    private String currentValue;

    public static void main(final String[] args) throws Exception
    {
        final XMLReader xmlReader = XMLReaderFactory.createXMLReader();

        final FileReader reader = new FileReader("datasource.xml");
        final InputSource inputSource = new InputSource(reader);

        xmlReader.setContentHandler(new SaxExample());
        xmlReader.parse(inputSource);
    }

    @Override
    public void characters(final char[] ch, final int start, final int length) throws     SAXException
    {
        currentValue = new String(ch, start, length);
    }

    @Override
    public void startElement(final String uri, final String localName, final String     qName, final Attributes atts) throws SAXException
    {
        // react on the beginning of tag "GroupBMsg" <GroupBMSg>
        if (localName.equals("GroupBMsg"))
        {
            currentValue="";
        }
    }

    @Override
    public void endElement(final String uri, final String localName, final String     qName) throws SAXException
    {
        // react on the ending of tag "GroupBMsg" </GroupBMSg>
        if (localName.equals("GroupBMsg"))
        {
            // TODO: write into file
            System.out.println(currentValue);
        }
    }


    // the rest is boilerplate code for sax

    @Override
    public void endDocument() throws SAXException {}
    @Override
    public void endPrefixMapping(final String prefix) throws SAXException {}
    @Override
    public void ignorableWhitespace(final char[] ch, final int start, final int length)
        throws SAXException {}
    @Override
    public void processingInstruction(final String target, final String data)
        throws SAXException {}
    @Override
    public void setDocumentLocator(final Locator locator) {  }
    @Override
    public void skippedEntity(final String name) throws SAXException {}
    @Override
    public void startDocument() throws SAXException {}
    @Override
    public void startPrefixMapping(final String prefix, final String uri)
      throws SAXException {}
}

Upvotes: 0

Jason Griebeler
Jason Griebeler

Reputation: 61

Within Spring Batch, I've written my own stax event item reader implementation that operates a bit more specifically than previously mentioned. Basically, I just stuff elements into a map and then pass them into the ItemProcessor. From there, you're free to transform it into a single object (see CompositeItemProcessor) from the "GatheredElement". Apologies for having a little copy/paste from the StaxEventItemReader, but I don't think it's avoidable.

From here, you're free to use whatever OXM marshaller you'd like, I happen to use JAXB as well.

public class ElementGatheringStaxEventItemReader<T> extends StaxEventItemReader<T> {
    private Map<String, String> gatheredElements;
    private Set<String> elementsToGather;
    ...
    @Override
    protected boolean moveCursorToNextFragment(XMLEventReader reader) throws NonTransientResourceException {
        try { 
            while (true) {
                while (reader.peek() != null && !reader.peek().isStartElement()) {
                    reader.nextEvent();
                }
                if (reader.peek() == null) {
                    return false;
                }
                QName startElementName = ((StartElement) reader.peek()).getName();
                if(elementsToGather.contains(startElementName.getLocalPart())) {
                    reader.nextEvent(); // move past the actual start element
                    XMLEvent dataEvent = reader.nextEvent();
                    gatheredElements.put(startElementName.getLocalPart(), dataEvent.asCharacters().getData());
                    continue;
                }
                if (startElementName.getLocalPart().equals(fragmentRootElementName)) {
                    if (fragmentRootElementNameSpace == null || startElementName.getNamespaceURI().equals(fragmentRootElementNameSpace)) {
                        return true;
                    }
                }
                reader.nextEvent();

            }
        } catch (XMLStreamException e) {
            throw new NonTransientResourceException("Error while reading from event reader", e);
        }
    }

    @SuppressWarnings("unchecked")
    @Override
    protected T doRead() throws Exception {
        T item = super.doRead();
        if(null == item)
            return null;
        T result = (T) new GatheredElementItem<T>(item, new     HashedMap(gatheredElements));
        if(log.isDebugEnabled())
            log.debug("Read GatheredElementItem: " + result);
        return result; 
    }

The gathered element class is pretty basic:

public class GatheredElementItem<T> {
    private final T item;
    private final Map<String, String> gatheredElements;
    ...
}

Upvotes: 1

Otto
Otto

Reputation: 3294

give a try to some ETL tool like

Pentaho Data Integration (AKA Kettle)

Upvotes: 0

Weber
Weber

Reputation: 89

At last, I implement a customized StaxEventItemReader.

  1. Config fragmentRootElementName

  2. Config my own manualHandleElement

    <property name="manualHandleElement">
    <list>
        <map>
            <entry>
                <key><value>startElementName</value></key>
                <value>GroupA</value>
            </entry>
            <entry>
                <key><value>endElementName</value></key>
                <value>GroupAHeader</value>
            </entry>
            <entry>
                <key><value>elementNameList</value></key>
                    <list>
                            <value>/GroupAHeader/Info1</value>
                            <value>/GroupAHeader/Info2</value>
                    </list>
            </entry>
        </map>
    </list>
    

  3. Add following fragment in MyStaxEventItemReader.doRead()

    while(true){
    if(reader.peek() != null && reader.peek().isStartElement()){
        pathList.add("/"+((StartElement) reader.peek()).getName().getLocalPart());
        reader.nextEvent();
        continue;
    }
    if(reader.peek() != null && reader.peek().isEndElement()){
        pathList.remove("/"+((EndElement) reader.peek()).getName().getLocalPart());
        if(isManualHandleEndElement(((EndElement) reader.peek()).getName().getLocalPart())){
            pathList.clear();
            reader.nextEvent();
            break;
        }
        reader.nextEvent();
        continue;
    }
    if(reader.peek() != null && reader.peek().isCharacters()){
        CharacterEvent charEvent = (CharacterEvent)reader.nextEvent();
        String currentPath = getCurrentPath(pathList);
        String startElementName = (String)currentManualHandleStartElement.get(MANUAL_HANDLE_START_ELEMENT_NAME);
        for(Object s : (List)currentManualHandleStartElement.get(MANUAL_HANDLE_ELEMENT_NAME_LIST)){
            if(("/"+startElementName+s).equals(currentPath)){
                map.put(getCurrentPath(pathList), charEvent.getData());
                break;
            }
        }
        continue;
    }
    
    reader.nextEvent();
    

    }

Upvotes: 0

Waleed Almadanat
Waleed Almadanat

Reputation: 1037

I haven't dealt with such huge file sizes, but considering your problem, since you want to parse the and write to a flat file, I'm guessing a combination XML Pull Parsing and smart code to write to the flat file (this might help), because we don't want to exhaust the Java heap. You can do a quick Google search for tutorials and sample code on using XML Pull Parsing.

Upvotes: 0

Related Questions