MicroDev92
MicroDev92

Reputation: 169

How to set MemoryStream position based on IndexOf, to split apart a sequence of XML documents?

I have a pseudo XML file with 5 small xmls in it like so:

XML document fragment stream

What I am trying to achieve is separate and create a new file for each of these XMLs using MemoryStream with this code:

int flag = 0;

byte[] arr = Encoding.ASCII.GetBytes(File.ReadAllText(@"C:\\Users\\Aleksa\\Desktop\\testTxt.xml"));

for (int i = 0; i <= 5; i++)
{
    MemoryStream mem = new MemoryStream(arr);
    mem.Position = flag;
    StreamReader rdr = new StreamReader(mem);

    string st = rdr.ReadToEnd();

    if (st.IndexOf("<TestNode") != -1 && (st.IndexOf("</TestNode>") != -1 || st.IndexOf("/>") != -1))
    {
        int curr = st.IndexOf("<TestNode");
        int end = st.IndexOf("\r");
        string toWrite = st.Substring(st.IndexOf("<TestNode"), end);
        File.WriteAllText(@"C:\\Users\\Aleksa\\Desktop\\" + i.ToString() + ".xml", toWrite);
        flag += end;
    }
    Console.WriteLine(st);
}

The first XML from the image gets separated and is okay, the rest are empty files, while debugging I noticed that even though I set the position to be the end variable it still streams from the top, also all iterations after the first have the end variable equal to zero!

I have tried changing the IndexOf parameter to </TestNode> + 11 which does the same as the code above except the rest of the files aren't empty but are not complete, leaving me with <TestNode a. How can I fix the logic here and split my stream of XML document(s) apart?

Upvotes: 1

Views: 391

Answers (1)

dbc
dbc

Reputation: 117284

Your input stream consists of XML document fragments -- i.e. a series of XML root elements concatenated together.

You can read such a stream by using an XmlReader created with XmlReaderSettings.ConformanceLevel == ConformanceLevel.Fragment. From the docs:

Fragment

Ensures that the XML data conforms to the rules for a well-formed XML 1.0 document fragment.

This setting accepts XML data with multiple root elements, or text nodes at the top-level.

The following extension methods can be used for this task:

public static class XmlReaderExtensions
{
    public static IEnumerable<XmlReader> ReadRoots(this XmlReader reader)
    {
        while (reader.Read())
        {
            if (reader.NodeType == XmlNodeType.Element)
            {
                using (var subReader = reader.ReadSubtree())
                    yield return subReader;
            }
        }
    }

    public static void SplitDocumentFragments(Stream stream, Func<int, string> makeFileName, Action<string, IXmlLineInfo> onFileWriting, Action<string, IXmlLineInfo> onFileWritten)
    {
        using (var textReader = new StreamReader(stream, Encoding.UTF8, true, 4096, true))
        {
            SplitDocumentFragments(textReader, makeFileName, onFileWriting, onFileWritten);
        }
    }

    public static void SplitDocumentFragments(TextReader textReader, Func<int, string> makeFileName, Action<string, IXmlLineInfo> onFileWriting, Action<string, IXmlLineInfo> onFileWritten)
    {
        if (textReader == null || makeFileName == null)
            throw new ArgumentNullException();
        var settings = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment, CloseInput = false };
        using (var xmlReader = XmlReader.Create(textReader, settings))
        {
            var lineInfo = xmlReader as IXmlLineInfo;
            var index = 0;

            foreach (var reader in xmlReader.ReadRoots())
            {
                var outputName = makeFileName(index);
                reader.MoveToContent();
                if (onFileWriting != null)
                    onFileWriting(outputName, lineInfo);
                using(var writer = XmlWriter.Create(outputName))
                {
                    writer.WriteNode(reader, true);
                }
                index++;
                if (onFileWritten != null)
                    onFileWritten(outputName, lineInfo);
            }
        }
    }
}

Then you would use it as follows:

var fileName = @"C:\\Users\\Aleksa\\Desktop\\testTxt.xml";
var outputPath = ""; // The directory in which to create your XML files.
using (var stream = File.OpenRead(fileName))
{
    XmlReaderExtensions.SplitDocumentFragments(stream,
                                               index => Path.Combine(outputPath, index.ToString() + ".xml"),
                                               (name, lineInfo) => 
                                               {
                                                   Console.WriteLine("Writing {0}, starting line info: LineNumber = {1}, LinePosition = {2}...", 
                                                                     name, lineInfo?.LineNumber, lineInfo?.LinePosition);
                                               },
                                               (name, lineInfo) => 
                                               {
                                                   Console.WriteLine("   Done.  Result: ");
                                                   Console.Write("   ");
                                                   Console.WriteLine(File.ReadAllText(name));
                                               });
}

And the output will look something like:

Writing 0.xml, starting line info: LineNumber = 1, LinePosition = 2...
   Done.  Result: 
   <?xml version="1.0" encoding="utf-8"?><TestNode active="1" lastName="l"><Foo /> </TestNode>
Writing 1.xml, starting line info: LineNumber = 2, LinePosition = 2...
   Done.  Result: 
   <?xml version="1.0" encoding="utf-8"?><TestNode active="2" lastName="l" />
Writing 2.xml, starting line info: LineNumber = 3, LinePosition = 2...
   Done.  Result: 
   <?xml version="1.0" encoding="utf-8"?><TestNode active="3" lastName="l"><Foo />  </TestNode>

... (others omitted).

Notes:

  • The method ReadRoots() reads through all the root elements of the XML fragment stream returns a nested reader restricted to just that specific root, by using XmlReader.ReadSubtree():

    Returns a new XmlReader instance that can be used to read the current node, and all its descendants. ... When the new XML reader has been closed, the original reader is positioned on the EndElement node of the sub-tree.

    This allows callers of the method to parse each root individually without worrying about reading past the end of the root and into the next one. Then the contents of each root node can be copied to an output XmlWriter using XmlWriter.WriteNode(XmlReader, true).

  • You can track approximate position in the file using the IXmlLineInfo interface which is implemented by XmlReader subclasses that parse text streams. If your document fragment stream is truncated for some reason, this can help identify where the error occurs.

    See: getting the current position from an XmlReader and C# how can I debug a deserialization exception? for details.

  • If you are parsing a string st containing your XML fragments rather that reading directly from a file, you can pass a StringReader to SplitDocumentFragments():

    using (var textReader = new StringReader(st))
    {
            XmlReaderExtensions.SplitDocumentFragments(textReader, 
    // Remainder as before
    
  • Do not read an XML stream using Encoding.ASCII, this will strip all non-English characters from the file. Instead, use Encoding.UTF8 and/or detect the encoding from the BOM or XML declaration.

Demo fiddle here.

Upvotes: 2

Related Questions