Reputation: 34261
I have a file that consists of concatenated valid XML documents. I'd like to separate individual XML documents efficiently.
Contents of the concatenated file will look like this, thus the concatenated file is not itself a valid XML document.
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
Each individual XML document around 1-4 KB, but there is potentially a few hundred of them. All XML documents correspond to same XML Schema.
Any suggestions or tools? I am working in the Java environment.
Edit: I am not sure if the xml-declaration will be present in documents or not.
Edit: Let's assume that the encoding for all the xml docs is UTF-8.
Upvotes: 9
Views: 2425
Reputation: 679
This is my answer for the C# version. very ugly code that works :-\
public List<T> ParseMultipleDocumentsByType<T>(string documents)
{
var cleanParsedDocuments = new List<T>();
var serializer = new XmlSerializer(typeof(T));
var flag = true;
while (flag)
{
if(documents.Contains(typeof(T).Name))
{
var startingPoint = documents.IndexOf("<?xml");
var endingString = "</" +typeof(T).Name + ">";
var endingPoing = documents.IndexOf(endingString) + endingString.Length;
var document = documents.Substring(startingPoint, endingPoing - startingPoint);
var singleDoc = (T)XmlDeserializeFromString(document, typeof(T));
cleanParsedDocuments.Add(singleDoc);
documents = documents.Remove(startingPoint, endingPoing - startingPoint);
}
else
{
flag = false;
}
}
return cleanParsedDocuments;
}
public static object XmlDeserializeFromString(string objectData, Type type)
{
var serializer = new XmlSerializer(type);
object result;
using (TextReader reader = new StringReader(objectData))
{
result = serializer.Deserialize(reader);
}
return result;
}
Upvotes: 1
Reputation: 100638
I don't have a Java answer, but here's how I solved this problem with C#.
I created a class named XmlFileStreams to scan the source document for the XML document declaration and break it up logically into multiple documents:
class XmlFileStreams {
List<int> positions = new List<int>();
byte[] bytes;
public XmlFileStreams(string filename) {
bytes = File.ReadAllBytes(filename);
for (int pos = 0; pos < bytes.Length - 5; ++pos)
if (bytes[pos] == '<' && bytes[pos + 1] == '?' && bytes[pos + 2] == 'x' && bytes[pos + 3] == 'm' && bytes[pos + 4] == 'l')
positions.Add(pos);
positions.Add(bytes.Length);
}
public IEnumerable<Stream> Streams {
get {
if (positions.Count > 1)
for (int i = 0; i < positions.Count - 1; ++i)
yield return new MemoryStream(bytes, positions[i], positions[i + 1] - positions[i]);
}
}
}
To use XmlFileStreams:
foreach (Stream stream in new XmlFileStreams(@"c:\tmp\test.xml").Streams) {
using (var xr = XmlReader.Create(stream, new XmlReaderSettings() { XmlResolver = null, ProhibitDtd = false })) {
// parse file using xr
}
}
There are a couple of caveats.
Upvotes: 0
Reputation: 48066
Since you're not sure the declaration will always be present, you can strip all declarations (a regex such as <\?xml version.*\?>
can find these), prepend <doc-collection>
, append </doc-collection>
, such that the resultant string will be a valid xml document. In it, you can retrieve the separate documents using (for instance) the XPath query /doc-collection/*
. If the combined file can be large enough that memory consumption becomes an issue, you may need to use a streaming parser such as Sax, but the principle remains the same.
In a similar scenario which I encountered, I simply read the concatenated document directly using an xml-parser: Although the concatenated file may not be a valid xml document, it is a valid xml fragment (barring the repeated declarations) - so, once you strip the declarations, if your parser supports parsing fragments, then you can also just read the result directly. All top-level elements will then be the root elements of the concatenated documents.
In short, if you strip all declarations, you'll have a valid xml fragment which is trivially parseable either directly or by surrounding it with some tag.
Upvotes: 3
Reputation: 27464
As Eamon says, if you know the <?xml> thing will always be there, just break on that.
Failing that, look for the ending document-level tag. That is, scan the text counting how many levels deep you are. Every time you see a tag that begins with "<" but not "</" and that does not end with "/>", add 1 to the depth count. Every time you see a tag that begins "</", subtract 1. Every time you subtract 1, check if you are now at zero. If so, you've reached the end of an XML document.
Upvotes: 3
Reputation: 26682
Don't split! Add one big tag around it! Then it becomes one XML file again:
<BIGTAG>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
</BIGTAG>
Now, using /BIGTAG/SomeData would give you all the XML roots.
By merging them into one file, you've altered the encoding...
Upvotes: 4