Reputation: 346

Split XML stream by XML documents

I have a stream of schemaless XML documents like this:

<?xml version="1.0" encoding="UTF-8"?>
<message id="1">
    <text>aaaaaaa</text>
</message>
<kuku>
    bbbbb
</kuku>
<?xml version="1.0" encoding="UTF-8"?>
<other_message id="3">
    <text>ccccc</text>
</other_message>

Need to parse documents is stream mode. The solution to wrap the stream by single root XML element doesn't work because StAX fails when it meets <?xml ... ?> element inside the document. But it can be used if I will be able to skip this element in input stream.

All the documents can be different, so there is no common end_document XML element.

Upvotes: 0

Answers (2)

Progman

Reputation: 19555

It is possible, but it's somehow ugly and hackish. You can use StAX and count the opening and closing element tags. Increase the counter on an opening tag and decrease at a closing tag. When you reach 0 you know you have completely read the root element. Use the getLocation() method on the XMLStreamReader to see how far you have read, specially the getCharsetOffset() method. With the new position/offset from your original source/stream, you can build a new stream with the start point at the next XML declaration. As a proof-of-concept see the following code:

String content = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n"+
            "<foobar>\n"+
            "    <bla />\n"+
            "</foobar>\n"+
            "<?xml version=\"1.0\" encoding=\"ASCII\" ?>\n"+
            "<first with=\"attributes\">\n"+
            "   <second>\n"+
            "       <third />\n"+
            "   </second>\n"+
            "</first>";
XMLInputFactory factory = XMLInputFactory.newFactory();
InputStream stream = new ByteArrayInputStream(content.getBytes());
XMLStreamReader xmlReader = factory.createXMLStreamReader(stream);
int nestingCounter = 0;
int characterOffset = 0;
while(xmlReader.hasNext()) {
    int event = xmlReader.next();
    characterOffset = xmlReader.getLocation().getCharacterOffset();
    if (event == XMLStreamConstants.START_ELEMENT) {
        nestingCounter++;
    }
    if (event == XMLStreamConstants.END_ELEMENT) {
        nestingCounter--;
    }
    // work with the event/data here
    System.out.println(event);
    if (nestingCounter == 0) {
        break;
    }
}

System.out.println("Second XML");

// build a new stream
content = content.substring(characterOffset).trim();
xmlReader = factory.createXMLStreamReader(new ByteArrayInputStream(content.getBytes()));

// now, again...
while(xmlReader.hasNext()) {
    int event = xmlReader.next();
    if (event == XMLStreamConstants.START_ELEMENT) {
        nestingCounter++;
    }
    if (event == XMLStreamConstants.END_ELEMENT) {
        nestingCounter--;
    }           
    // work with the event/data here
    System.out.println(event);
    if (nestingCounter == 0) {
        break;
    }
}

This will generate the following output (and not throw an exception):

1
4
1
2
4
2
Second XML
1
4
1
4
1
2
4
2
4
2

Obviously you should use a proper loop and close the streams and readers, this is only a proof-of-concept. Also, you might run in problems when you have other stuff between the closing tag of the previous root element and the new XML declaration, because you can have this stuff at the end, but not at the beginning of an XML document:

2.1 Well-Formed XML Documents
document     ::=      prolog element Misc*
2.8 Prolog and Document Type Declaration
Misc         ::=      Comment | PI | S

Upvotes: 1

Michael Kay

Reputation: 163448

There is no way of doing this 100% reliably. You can't do it using an XML parser because it will report an error when it sees the second XML declaration, and there's no way of recovering from that error. So you have to do it using your own "pre-parsing", and there is always a risk that your pre-parsing will recognize something that looks like an XML declaration, but isn't, because (for example) it's within the body of an XML comment or CDATA section. But that's probably the best you can do. The elegant way of doing it is probably to write an implementation of InputStream that delivers an Iterable sequence of InputStreams, and then loop over this iterable passing each one to the XML parser in turn. Alternatively your suggestion of filtering out the XML declarations (and adding an outer wrapping start tag and end tag) would also work.

Better, encourage the person who supplied this data that they're doing it wrong.

Upvotes: 2

Split XML stream by XML documents

Answers (2)

2.1 Well-Formed XML Documents

2.8 Prolog and Document Type Declaration

Related Questions