Juha Syrjälä
Juha Syrjälä

Reputation: 34261

How to efficiently parse concatenated XML documents from a file

I have a file that consists of concatenated valid XML documents. I'd like to separate individual XML documents efficiently.

Contents of the concatenated file will look like this, thus the concatenated file is not itself a valid XML document.

<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>

Each individual XML document around 1-4 KB, but there is potentially a few hundred of them. All XML documents correspond to same XML Schema.

Any suggestions or tools? I am working in the Java environment.

Edit: I am not sure if the xml-declaration will be present in documents or not.

Edit: Let's assume that the encoding for all the xml docs is UTF-8.

Upvotes: 9

Views: 2425

Answers (5)

Nadav Hury
Nadav Hury

Reputation: 679

This is my answer for the C# version. very ugly code that works :-\

public List<T> ParseMultipleDocumentsByType<T>(string documents)
    {
        var cleanParsedDocuments = new List<T>();
        var serializer = new XmlSerializer(typeof(T));
        var flag = true;
        while (flag)
        {
            if(documents.Contains(typeof(T).Name))
            {
                var startingPoint = documents.IndexOf("<?xml");
                var endingString = "</" +typeof(T).Name + ">";
                var endingPoing = documents.IndexOf(endingString) + endingString.Length;
                var document = documents.Substring(startingPoint, endingPoing - startingPoint);
                var singleDoc = (T)XmlDeserializeFromString(document, typeof(T));
                cleanParsedDocuments.Add(singleDoc);
                documents = documents.Remove(startingPoint, endingPoing - startingPoint);
            }
            else
            {
                flag = false;
            }
        }


        return cleanParsedDocuments;
    }

    public static object XmlDeserializeFromString(string objectData, Type type)
    {
        var serializer = new XmlSerializer(type);
        object result;

        using (TextReader reader = new StringReader(objectData))
        {
            result = serializer.Deserialize(reader);
        }

        return result;
    }

Upvotes: 1

Ferruccio
Ferruccio

Reputation: 100638

I don't have a Java answer, but here's how I solved this problem with C#.

I created a class named XmlFileStreams to scan the source document for the XML document declaration and break it up logically into multiple documents:

class XmlFileStreams {

    List<int> positions = new List<int>();
    byte[] bytes;

    public XmlFileStreams(string filename) {
        bytes = File.ReadAllBytes(filename);
        for (int pos = 0; pos < bytes.Length - 5; ++pos)
            if (bytes[pos] == '<' && bytes[pos + 1] == '?' && bytes[pos + 2] == 'x' && bytes[pos + 3] == 'm' && bytes[pos + 4] == 'l')
                positions.Add(pos);
        positions.Add(bytes.Length);
    }

    public IEnumerable<Stream> Streams {
        get {
            if (positions.Count > 1)
                for (int i = 0; i < positions.Count - 1; ++i)
                    yield return new MemoryStream(bytes, positions[i], positions[i + 1] - positions[i]);
        }
    }

}

To use XmlFileStreams:

foreach (Stream stream in new XmlFileStreams(@"c:\tmp\test.xml").Streams) {
    using (var xr = XmlReader.Create(stream, new XmlReaderSettings() { XmlResolver = null, ProhibitDtd = false })) {
        // parse file using xr
    }
}

There are a couple of caveats.

  1. It reads the entire file into memory for processing. This could be a problem if the file is really big.
  2. It uses a simple brute force search to look for the XML document boundaries.

Upvotes: 0

Eamon Nerbonne
Eamon Nerbonne

Reputation: 48066

Since you're not sure the declaration will always be present, you can strip all declarations (a regex such as <\?xml version.*\?> can find these), prepend <doc-collection>, append </doc-collection>, such that the resultant string will be a valid xml document. In it, you can retrieve the separate documents using (for instance) the XPath query /doc-collection/*. If the combined file can be large enough that memory consumption becomes an issue, you may need to use a streaming parser such as Sax, but the principle remains the same.

In a similar scenario which I encountered, I simply read the concatenated document directly using an xml-parser: Although the concatenated file may not be a valid xml document, it is a valid xml fragment (barring the repeated declarations) - so, once you strip the declarations, if your parser supports parsing fragments, then you can also just read the result directly. All top-level elements will then be the root elements of the concatenated documents.

In short, if you strip all declarations, you'll have a valid xml fragment which is trivially parseable either directly or by surrounding it with some tag.

Upvotes: 3

Jay
Jay

Reputation: 27464

As Eamon says, if you know the <?xml> thing will always be there, just break on that.

Failing that, look for the ending document-level tag. That is, scan the text counting how many levels deep you are. Every time you see a tag that begins with "<" but not "</" and that does not end with "/>", add 1 to the depth count. Every time you see a tag that begins "</", subtract 1. Every time you subtract 1, check if you are now at zero. If so, you've reached the end of an XML document.

Upvotes: 3

Wim ten Brink
Wim ten Brink

Reputation: 26682

Don't split! Add one big tag around it! Then it becomes one XML file again:

<BIGTAG>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
</BIGTAG>

Now, using /BIGTAG/SomeData would give you all the XML roots.


If processing instructions are in the way, you can always use a RegEx to remove them. It's easier to just remove all processing instructions than to use a RegEx to find all root nodes. If encoding differs for all documents then remember this: the whole document itself must have been encoded by some encoding type, thus all those XML documents it includes will be using the same encoding, no matter what each header is telling you. If the big file is encoded as UTF-16 then it doesn't matter if the XML processing instructions say the XML itself is UTF-8. It won't be UTF-8 since the whole file is UTF-16. The encoding in those XML processing instructions is therefor invalid.

By merging them into one file, you've altered the encoding...


By RegEx, I mean regular expressions. You just have to remove all text that's between a <? and a ?> which should not be too difficult with a regular expression and slightly more complicated if you're trying other string manipulation techniques.

Upvotes: 4

Related Questions