3x071c
3x071c

Reputation: 1016

C# XmlReader reads XML wrong and different based on how I invoke the reader's methods

So my current understanding of how the C# XmlReader works is that it takes a given XML File and reads it node-by-node when I wrap it in a following construct:

using System.Xml;
using System;
using System.Diagnostics;
...
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreComments = true;
settings.IgnoreWhitespace = true;
settings.IgnoreProcessingInstructions = true;
using (XmlReader reader = XmlReader.Create(path, settings))
{
    while (reader.Read())
    {
        // All reader methods I call here will reference the current node
        // until I move the pointer to some further node by calling methods like
        // reader.Read(), reader.MoveToContent(), reader.MoveToElement() etc
    }
}

Why will the following two snippets (within the above construct) produce two very different results, even though they both call the same methods?

I used this example file for testing.

Debug.WriteLine(new string(' ', reader.Depth * 2) + "<" + reader.NodeType.ToString() + "|" + reader.Name + ">" + reader.ReadString() + "</>");

(Snippet 1)
vs
(Snippet 2)

string xmlcontent = reader.ReadString();
string xmlname = reader.Name.ToString();
string xmltype = reader.NodeType.ToString();
int xmldepth = reader.Depth;
Debug.WriteLine(new string(' ', xmldepth * 2) + "<" + xmltype + "|" + xmlname + ">" + xmlcontent + "</>");

Output of Snippet 1:

<XmlDeclaration|xml></>
<Element|rss></>
    <Element|head></>
        <Text|>Test Xml File</>
      <Element|description>This will test my xml reader</>
    <EndElement|head></>
    <Element|body></>
        <Element|g:id>1QBX23</>
        <Element|g:title>Example Title</>
        <Element|g:description>Example Description</>
      <EndElement|item></>
      <Element|item></>
          <Text|>2QXB32</>
        <Element|g:title>Example Title</>
        <Element|g:description>Example Description</>
      <EndElement|item></>
    <EndElement|body></>
  <EndElement|xml></>
<EndElement|rss></>

Yes, this is formatted as it was in my output window. As to be seen it skipped certain elements and outputted a wrong depth for a few others. Therefore, the NodeTypes are correct, unlike Snippet Number 2, which outputs:

<XmlDeclaration|xml></>
  <Element|xml></>
      <Element|title></>
      <EndElement|title>Test Xml File</>
      <EndElement|description>This will test my xml reader</>
    <EndElement|head></>
      <Element|item></>
        <EndElement|g:id>1QBX23</>
        <EndElement|g:title>Example Title</>
        <EndElement|g:description>Example Description</>
      <EndElement|item></>
        <Element|g:id></>
        <EndElement|g:id>2QXB32</>
        <EndElement|g:title>Example Title</>
        <EndElement|g:description>Example Description</>
      <EndElement|item></>
    <EndElement|body></>
  <EndElement|xml></>
<EndElement|rss></>

Once again, the depth is messed up, but it's not as critical as with Snippet Number 1. It also skipped some elements and assigned wrong NodeTypes.

Why can't it output the expected result? And why do these two snippets produce two totally different outputs with different depths, NodeTypes and skipped nodes?
I'd appreciate any help on this. I searched a lot for any answers on this but it seems like I'm the only one experiencing these issues. I'm using the .NET Framework 4.6.2 with Asp.net Web Forms in Visual Studio 2017.

Upvotes: 2

Views: 1900

Answers (1)

dbc
dbc

Reputation: 116805

Firstly, you are using a method XmlReader.ReadString() that is deprecated:

XmlReader.ReadString Method

... reads the contents of an element or text node as a string. However, we recommend that you use the ReadElementContentAsString method instead, because it provides a more straightforward way to handle this operation.

However, beyond warning us off the method, the documentation doesn't precisely specify what it actually does. To determine that, we need to go to the reference source:

public virtual  string  ReadString() {
    if (this.ReadState != ReadState.Interactive) {
        return string.Empty;
    }
    this.MoveToElement();
    if (this.NodeType == XmlNodeType.Element) {
        if (this.IsEmptyElement) {
            return string.Empty;
        }
        else if (!this.Read()) {
            throw new InvalidOperationException(Res.GetString(Res.Xml_InvalidOperation));
        }
        if (this.NodeType == XmlNodeType.EndElement) {
            return string.Empty;
        }
    }
    string result = string.Empty;
    while (IsTextualNode(this.NodeType)) {
        result += this.Value;
        if (!this.Read()) {
            break;
        }
    }
    return result;
}

This method does the following:

  1. If the current node is an empty element node, return an empty string.

  2. If the current node is an element that is not empty, advance the reader.

    If the now-current node is the end of the element, return an empty string.

  3. While the current node is a text node, add the text to a string and advance the reader. As soon as the current node is not a text node, return the accumulated string.

Thus we can see that this method is designed to advance the reader. We can also see that, given mixed-content XML like <head>text <b>BOLD</b> more text</head>, ReadString() will only partially read the <head> element, leaving the reader positioned on <b>. This oddity is likely why Microsoft deprecated the method.

We can also see why your two snippets function differently. In the first, you get reader.Depth and reader.NodeType before calling ReadString() and advancing the reader. In the second you get these properties after advancing the reader.

Since your intent is to iterate through the nodes and get the value of each, rather than ReadString() or ReadElementContentAsString() you should just use XmlReader.Value:

gets the text value of the current node.

Thus your corrected code should look like:

 string xmlcontent = reader.Value;
 string xmlname = reader.Name.ToString();
 string xmltype = reader.NodeType.ToString();
 int xmldepth = reader.Depth;
 Console.WriteLine(new string(' ', xmldepth * 2) + "<" + xmltype + "|" + xmlname + ">" + xmlcontent + "</>");

XmlReader is tricky to work with. You always need to check the documentation to determine exactly where a given method positions the reader. For instance, XmlReader.ReadElementContentAsString() moves the reader past the end of the element, whereas XmlReader.ReadSubtree() moves the reader to the end of the element. But as a general rule any method named Read is going to advance the reader, so you need to be careful using a Read method inside an outer while (reader.Read()) loop.

Demo fiddle here.

Upvotes: 2

Related Questions