Reputation: 89
I need to parse a large complex xml and write to a Flat file, can you give some advise?
File size: 500MB Record count: 100K XML structure:
<Msg>
<MsgHeader>
<!--Some of the fields in the MsgHeader need to be map to a java object-->
</MsgHeader>
<GroupA>
<GroupAHeader/>
<!--Some of the fields in the GroupAHeader need to be map to a java object-->
<GroupAMsg/>
<!--50K records-->
<GroupAMsg/>
<GroupAMsg/>
<GroupAMsg/>
</GroupA>
<GroupB>
<GroupBHeader/>
<GroupBMsg/>
<!--50K records-->
<GroupBMsg/>
<GroupBMsg/>
<GroupBMsg/>
</GroupB>
</Msg>
Upvotes: 5
Views: 6017
Reputation: 1208
You can use Declarative Stream Mapping (DSM) stream parsing library. It can process both JSON and XML. It doesn't load XML file in to memory. DSM only process data that you defined in YAML or JSON config.
You can call method while reading XML.This allows you to process XML partially. You can deserialzie this partially read XML data to Java object.
Even you can use it to read in multiple thread.
You can find good example in this Answer
Unmarshalling XML to three lists of different objects using STAX Parser
JAVA - Best approach to parse huge (extra large) JSON file (same for XML)
Upvotes: 0
Reputation: 792
If you accept an solution aside JAXB/Spring Batch, you may want to have a look at the SAX Parser.
This is a more event-oriented way of parsing XML files and may be a good approach when you want to directly write into the target file while parsing. The SAX Parser is not reading the whole xml content into memory but triggers methods when it enconters elements in the inputstream. As far as I have experienced it, this is a very memory-efficient way of processing.
In comparison to your Stax-Solution, SAX 'pushes' the data into your application - this means that you have to maintain the state (like in which tag you are corrently), so you have to keep track of your current location. I'm not sure if that is something you really require
The following example reads in an xml file in your structure and prints out all text within GroupBMsg-Tags:
import java.io.FileReader;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
public class SaxExample implements ContentHandler
{
private String currentValue;
public static void main(final String[] args) throws Exception
{
final XMLReader xmlReader = XMLReaderFactory.createXMLReader();
final FileReader reader = new FileReader("datasource.xml");
final InputSource inputSource = new InputSource(reader);
xmlReader.setContentHandler(new SaxExample());
xmlReader.parse(inputSource);
}
@Override
public void characters(final char[] ch, final int start, final int length) throws SAXException
{
currentValue = new String(ch, start, length);
}
@Override
public void startElement(final String uri, final String localName, final String qName, final Attributes atts) throws SAXException
{
// react on the beginning of tag "GroupBMsg" <GroupBMSg>
if (localName.equals("GroupBMsg"))
{
currentValue="";
}
}
@Override
public void endElement(final String uri, final String localName, final String qName) throws SAXException
{
// react on the ending of tag "GroupBMsg" </GroupBMSg>
if (localName.equals("GroupBMsg"))
{
// TODO: write into file
System.out.println(currentValue);
}
}
// the rest is boilerplate code for sax
@Override
public void endDocument() throws SAXException {}
@Override
public void endPrefixMapping(final String prefix) throws SAXException {}
@Override
public void ignorableWhitespace(final char[] ch, final int start, final int length)
throws SAXException {}
@Override
public void processingInstruction(final String target, final String data)
throws SAXException {}
@Override
public void setDocumentLocator(final Locator locator) { }
@Override
public void skippedEntity(final String name) throws SAXException {}
@Override
public void startDocument() throws SAXException {}
@Override
public void startPrefixMapping(final String prefix, final String uri)
throws SAXException {}
}
Upvotes: 0
Reputation: 61
Within Spring Batch, I've written my own stax event item reader implementation that operates a bit more specifically than previously mentioned. Basically, I just stuff elements into a map and then pass them into the ItemProcessor. From there, you're free to transform it into a single object (see CompositeItemProcessor) from the "GatheredElement". Apologies for having a little copy/paste from the StaxEventItemReader, but I don't think it's avoidable.
From here, you're free to use whatever OXM marshaller you'd like, I happen to use JAXB as well.
public class ElementGatheringStaxEventItemReader<T> extends StaxEventItemReader<T> {
private Map<String, String> gatheredElements;
private Set<String> elementsToGather;
...
@Override
protected boolean moveCursorToNextFragment(XMLEventReader reader) throws NonTransientResourceException {
try {
while (true) {
while (reader.peek() != null && !reader.peek().isStartElement()) {
reader.nextEvent();
}
if (reader.peek() == null) {
return false;
}
QName startElementName = ((StartElement) reader.peek()).getName();
if(elementsToGather.contains(startElementName.getLocalPart())) {
reader.nextEvent(); // move past the actual start element
XMLEvent dataEvent = reader.nextEvent();
gatheredElements.put(startElementName.getLocalPart(), dataEvent.asCharacters().getData());
continue;
}
if (startElementName.getLocalPart().equals(fragmentRootElementName)) {
if (fragmentRootElementNameSpace == null || startElementName.getNamespaceURI().equals(fragmentRootElementNameSpace)) {
return true;
}
}
reader.nextEvent();
}
} catch (XMLStreamException e) {
throw new NonTransientResourceException("Error while reading from event reader", e);
}
}
@SuppressWarnings("unchecked")
@Override
protected T doRead() throws Exception {
T item = super.doRead();
if(null == item)
return null;
T result = (T) new GatheredElementItem<T>(item, new HashedMap(gatheredElements));
if(log.isDebugEnabled())
log.debug("Read GatheredElementItem: " + result);
return result;
}
The gathered element class is pretty basic:
public class GatheredElementItem<T> {
private final T item;
private final Map<String, String> gatheredElements;
...
}
Upvotes: 1
Reputation: 3294
give a try to some ETL tool like
Pentaho Data Integration (AKA Kettle)
Upvotes: 0
Reputation: 89
At last, I implement a customized StaxEventItemReader.
Config fragmentRootElementName
Config my own manualHandleElement
<property name="manualHandleElement">
<list>
<map>
<entry>
<key><value>startElementName</value></key>
<value>GroupA</value>
</entry>
<entry>
<key><value>endElementName</value></key>
<value>GroupAHeader</value>
</entry>
<entry>
<key><value>elementNameList</value></key>
<list>
<value>/GroupAHeader/Info1</value>
<value>/GroupAHeader/Info2</value>
</list>
</entry>
</map>
</list>
Add following fragment in MyStaxEventItemReader.doRead()
while(true){
if(reader.peek() != null && reader.peek().isStartElement()){
pathList.add("/"+((StartElement) reader.peek()).getName().getLocalPart());
reader.nextEvent();
continue;
}
if(reader.peek() != null && reader.peek().isEndElement()){
pathList.remove("/"+((EndElement) reader.peek()).getName().getLocalPart());
if(isManualHandleEndElement(((EndElement) reader.peek()).getName().getLocalPart())){
pathList.clear();
reader.nextEvent();
break;
}
reader.nextEvent();
continue;
}
if(reader.peek() != null && reader.peek().isCharacters()){
CharacterEvent charEvent = (CharacterEvent)reader.nextEvent();
String currentPath = getCurrentPath(pathList);
String startElementName = (String)currentManualHandleStartElement.get(MANUAL_HANDLE_START_ELEMENT_NAME);
for(Object s : (List)currentManualHandleStartElement.get(MANUAL_HANDLE_ELEMENT_NAME_LIST)){
if(("/"+startElementName+s).equals(currentPath)){
map.put(getCurrentPath(pathList), charEvent.getData());
break;
}
}
continue;
}
reader.nextEvent();
}
Upvotes: 0
Reputation: 1037
I haven't dealt with such huge xml file sizes, but considering your problem, since you want to parse the xml and write to a flat file, I'm guessing a combination XML Pull Parsing and smart code to write to the flat file (this might help), because we don't want to exhaust the Java heap. You can do a quick Google search for tutorials and sample code on using XML Pull Parsing.
Upvotes: 0