Reputation: 1841
I'd be surprised if anyone can explain this, but it'd be interesting to know if others can reproduce the weirdness I'm experiencing...
We've got a thing based on InfoPath that processes a lot of forms. Form data should conform to an XSD, but InfoPath keeps adding its own metadata in the form of so-called "my-fields". We would like to remove the my-fields, and I wrote this simple method:
string StripMyFields(string xml)
{
var doc = new XmlDocument();
doc.LoadXml(xml);
var matches = doc.SelectNodes("//node()").Cast<XmlNode>().Where(n => n.NamespaceURI.StartsWith("http://schemas.microsoft.com/office/infopath/"));
Dbug("Found {0} nodes to remove.", matches.Count());
foreach (var m in matches)
m.ParentNode.RemoveChild(m);
return doc.OuterXml;
}
Now comes the really weird stuff! When I run this code it behaves as I expect it to, removing any nodes that are in InfoPath namespaces. However, if I comment out the call to Dbug, the code completes, but one "my-field" remains in the XML.
I've even commented out the content of the convenient Dbug method, and it still behaves this same way:
void Dbug(string s, params object[] args)
{
//if (args.Length > 0)
// s = string.Format(s, args);
//Debug.WriteLine(s);
}
Input XML:
<?xml version="1.0" encoding="UTF-8"?>
<skjema xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2008-03-03T22:25:25" xml:lang="en-us">
<Field-1643 orid="1643">data.</Field-1643>
<my:myFields>
<my:field1>Al</my:field1>
<my:group1>
<my:group2>
<my:field2 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">2009-01-01</my:field2>
<Field-1611 orid="1611">More data.</Field-1611>
<my:field3>true</my:field3>
</my:group2>
<my:group2>
<my:field2>2009-01-31</my:field2>
<my:field3>false</my:field3>
</my:group2>
</my:group1>
</my:myFields>
<Field-1612 orid="1612">Even more data.</Field-1612>
<my:field3>Blah blah</my:field3>
</skjema>
The "my:field3" element (at the bottom, text "Blah blah") is not removed unless I invoke Dbug.
Clearly the universe is not supposed to be like this, but I would be interested to know if others are able to reproduce.
I'm using VS2012 Premium (11.0.50727.1 RTMREL) and FW 4.5.50709 on Win8 Enterprise 6.2.9200.
Upvotes: 2
Views: 348
Reputation: 39
jimmy_keen's solution worked for me. I had just a simple
//d is an XmlDocument
XmlNodeList t = d.SelectNodes(xpath);
foreach (XmlNode x in t)
{
x.ParentNode.RemoveChild(x);
}
d.Save(outputpath);
this would remove only 3 nodes while stepping through in debug mode would remove 1000+ nodes.
Just adding a Count before the foreach solved the problem:
var count = t.Count;
Upvotes: 0
Reputation: 31454
First things first. LINQ uses concept known as deferred execution. This means no results are fetched until you actually materialize query (for example via enumeration).
Why would it matter with your nodes removal issue? Let's see what happens in your code:
SelectNodes
creates XPathNodeIterator
, which is used by XPathNavigator
which feeds data to XmlNodeList
returned by SelectNodes
XPathNodeIterator
walks xml document tree basing on XPath expression providedCast
and Where
simply decide whether node returned by XPathNodeIterator
should participate in final resultWe arrive right before DBug
method call. For a moment, assume it's not there. At this point, nothing have actually happened just yet. We only got unmaterialized LINQ query.
Things change when we start iterating. All the iterators (Cast
and Where
got their own iterators too) start rolling. WhereIterator
asks CastIterator
for item, which then asks XPathNodeIterator
which finally returns first node (Field-1643
). Unfortunately, this one fails the Where
test, so we ask for next one. More luck with my:myFields
, it is a match - we remove it.
We quickly proceed to my:field1
(again, WhereIterator → CastIterator → XPathNodeIterator), which is also removed. Stop here for a moment. Removing my:field1
detaches it from its parent, which results in setting its (my:field1
) siblings to null
(there's no other nodes before/after removed node).
What's the current state of things? XPathNodeIterator
knows its current element is my:field1
node, which just got removed. Removed as in detached from parent, but iterator still holds reference. Sounds great, let's ask it for next node. What XPathNodeIterator
does? Checks its Current
item, and asks for NextSibling
(since it has no children to walk first) - which is null
, given we just performed detachment. And this means iteration is over. Job done.
As a result, by altering collection structure during iteration, you only removed two nodes from your document (while in reality only one, as the second removed node was child of the one already removed).
Same behavior can be observed with much simpler XML:
<Root>
<James>Bond</James>
<Jason>Bourne</Jason>
<Jimmy>Keen</Jimmy>
<Tom />
<Bob />
</Root>
Suppose we want to get rid of nodes starting with J
, resulting in document containing only honest man names:
var doc = new XmlDocument();
doc.LoadXml(xml);
var matches = doc
.SelectNodes("//node()")
.Cast<XmlNode>()
.Where(n => n.Name.StartsWith("J"));
foreach (var node in matches)
{
node.ParentNode.RemoveChild(node);
}
Console.WriteLine(doc.InnerXml);
Unfortunately, Jason and Jimmy remain. James' next sibling (the one to be returned by iterator) was originally meant to be Jason, but as soon as we detached James from tree there's no siblings and iteration ends.
Now, why it works with DBug
? Count
call materializes query. Iterators have run, we got access to all nodes we need when we start looping. The same things happens with ToList
called right after Where
or if you inspect results during debug (VS even notifies you inspecting results will enumerate collection).
Upvotes: 3
Reputation: 1859
Very strange, its only when you actually view the results while debugging that it removes the last node. Incidentally, converting the result to a List and then looping through it also works.
List<XmlNode> matches = doc.SelectNodes("//node()").Cast<XmlNode>().Where(n => n.NamespaceURI.StartsWith("http://schemas.microsoft.com/office/infopath/")).ToList();
foreach (var m in matches)
{
m.ParentNode.RemoveChild(m);
}
Upvotes: 0
Reputation: 3388
I think this is down to the schrodinger's cat problem that Where will not actually compile the results of the query until you view or act upon it. Meaning, until you call Count() (or any other function for getting the results) or view it in debugger, the results don't exist. As a test, try put it as such:
if (matches.Any())
foreach (var m in matches)
m.ParentNode.RemoveChild(m);
Upvotes: 0