Reputation: 169
I have a pseudo XML file with 5 small xmls in it like so:
What I am trying to achieve is separate and create a new file for each of these XMLs using MemoryStream
with this code:
int flag = 0;
byte[] arr = Encoding.ASCII.GetBytes(File.ReadAllText(@"C:\\Users\\Aleksa\\Desktop\\testTxt.xml"));
for (int i = 0; i <= 5; i++)
{
MemoryStream mem = new MemoryStream(arr);
mem.Position = flag;
StreamReader rdr = new StreamReader(mem);
string st = rdr.ReadToEnd();
if (st.IndexOf("<TestNode") != -1 && (st.IndexOf("</TestNode>") != -1 || st.IndexOf("/>") != -1))
{
int curr = st.IndexOf("<TestNode");
int end = st.IndexOf("\r");
string toWrite = st.Substring(st.IndexOf("<TestNode"), end);
File.WriteAllText(@"C:\\Users\\Aleksa\\Desktop\\" + i.ToString() + ".xml", toWrite);
flag += end;
}
Console.WriteLine(st);
}
The first XML from the image gets separated and is okay, the rest are empty files, while debugging I noticed that even though I set the position
to be the end
variable it still streams from the top, also all iterations after the first have the end variable equal to zero!
I have tried changing the IndexOf parameter to </TestNode> + 11
which does the same as the code above except the rest of the files aren't empty but are not complete, leaving me with <TestNode a
. How can I fix the logic here and split my stream of XML document(s) apart?
Upvotes: 1
Views: 391
Reputation: 117284
Your input stream consists of XML document fragments -- i.e. a series of XML root elements concatenated together.
You can read such a stream by using an XmlReader
created with XmlReaderSettings.ConformanceLevel == ConformanceLevel.Fragment
. From the docs:
Fragment
Ensures that the XML data conforms to the rules for a well-formed XML 1.0 document fragment.
This setting accepts XML data with multiple root elements, or text nodes at the top-level.
The following extension methods can be used for this task:
public static class XmlReaderExtensions
{
public static IEnumerable<XmlReader> ReadRoots(this XmlReader reader)
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
using (var subReader = reader.ReadSubtree())
yield return subReader;
}
}
}
public static void SplitDocumentFragments(Stream stream, Func<int, string> makeFileName, Action<string, IXmlLineInfo> onFileWriting, Action<string, IXmlLineInfo> onFileWritten)
{
using (var textReader = new StreamReader(stream, Encoding.UTF8, true, 4096, true))
{
SplitDocumentFragments(textReader, makeFileName, onFileWriting, onFileWritten);
}
}
public static void SplitDocumentFragments(TextReader textReader, Func<int, string> makeFileName, Action<string, IXmlLineInfo> onFileWriting, Action<string, IXmlLineInfo> onFileWritten)
{
if (textReader == null || makeFileName == null)
throw new ArgumentNullException();
var settings = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment, CloseInput = false };
using (var xmlReader = XmlReader.Create(textReader, settings))
{
var lineInfo = xmlReader as IXmlLineInfo;
var index = 0;
foreach (var reader in xmlReader.ReadRoots())
{
var outputName = makeFileName(index);
reader.MoveToContent();
if (onFileWriting != null)
onFileWriting(outputName, lineInfo);
using(var writer = XmlWriter.Create(outputName))
{
writer.WriteNode(reader, true);
}
index++;
if (onFileWritten != null)
onFileWritten(outputName, lineInfo);
}
}
}
}
Then you would use it as follows:
var fileName = @"C:\\Users\\Aleksa\\Desktop\\testTxt.xml";
var outputPath = ""; // The directory in which to create your XML files.
using (var stream = File.OpenRead(fileName))
{
XmlReaderExtensions.SplitDocumentFragments(stream,
index => Path.Combine(outputPath, index.ToString() + ".xml"),
(name, lineInfo) =>
{
Console.WriteLine("Writing {0}, starting line info: LineNumber = {1}, LinePosition = {2}...",
name, lineInfo?.LineNumber, lineInfo?.LinePosition);
},
(name, lineInfo) =>
{
Console.WriteLine(" Done. Result: ");
Console.Write(" ");
Console.WriteLine(File.ReadAllText(name));
});
}
And the output will look something like:
Writing 0.xml, starting line info: LineNumber = 1, LinePosition = 2... Done. Result: <?xml version="1.0" encoding="utf-8"?><TestNode active="1" lastName="l"><Foo /> </TestNode> Writing 1.xml, starting line info: LineNumber = 2, LinePosition = 2... Done. Result: <?xml version="1.0" encoding="utf-8"?><TestNode active="2" lastName="l" /> Writing 2.xml, starting line info: LineNumber = 3, LinePosition = 2... Done. Result: <?xml version="1.0" encoding="utf-8"?><TestNode active="3" lastName="l"><Foo /> </TestNode> ... (others omitted).
Notes:
The method ReadRoots()
reads through all the root elements of the XML fragment stream returns a nested reader restricted to just that specific root, by using XmlReader.ReadSubtree()
:
Returns a new
XmlReader
instance that can be used to read the current node, and all its descendants. ... When the new XML reader has been closed, the original reader is positioned on theEndElement
node of the sub-tree.
This allows callers of the method to parse each root individually without worrying about reading past the end of the root and into the next one. Then the contents of each root node can be copied to an output XmlWriter
using XmlWriter.WriteNode(XmlReader, true)
.
You can track approximate position in the file using the IXmlLineInfo
interface which is implemented by XmlReader
subclasses that parse text streams. If your document fragment stream is truncated for some reason, this can help identify where the error occurs.
See: getting the current position from an XmlReader and C# how can I debug a deserialization exception? for details.
If you are parsing a string st
containing your XML fragments rather that reading directly from a file, you can pass a StringReader
to SplitDocumentFragments()
:
using (var textReader = new StringReader(st))
{
XmlReaderExtensions.SplitDocumentFragments(textReader,
// Remainder as before
Do not read an XML stream using Encoding.ASCII
, this will strip all non-English characters from the file. Instead, use Encoding.UTF8
and/or detect the encoding from the BOM or XML declaration.
Demo fiddle here.
Upvotes: 2