Ashish Gupta
Ashish Gupta

Reputation: 15139

C# - XML - Removing the nodes without using recursion

I have the following recursive method which takes the an XHTML document and marks nodes based on certain conditions and It is called like below for a number of HTML contents:-

XmlDocument document = new XmlDocument();
document.LoadXml(xmlAsString);
PrepNodesForDeletion(document.DocumentElement, document.DocumentElement);

The method definition is below

/// <summary>
/// Recursive function to identify and mark all unnecessary nodes so that they can be removed from the document.
/// </summary>
/// <param name="nodeToCompareAgainst">The node that we are recursively comparing all of its descendant nodes against</param>
/// <param name="nodeInQuestion">The node whose children we are comparing against the "nodeToCompareAgainst" node</param>
static void PrepNodesForDeletion(XmlNode nodeToCompareAgainst, XmlNode nodeInQuestion)
{
    if (infinityIndex++ > 100000)
    {
        throw;
    }
    foreach (XmlNode childNode in nodeInQuestion.ChildNodes)
    {
        // make sure we compare all of the childNodes descendants to the nodeToCompareAgainst
        PrepNodesForDeletion(nodeToCompareAgainst, childNode);

        if (AreNamesSame(nodeToCompareAgainst, childNode) && AllAttributesPresent(nodeToCompareAgainst, childNode))
        {
            // the function AnyAttributesWithDifferingValues assumes that all attributes are present between the two nodes
            if (AnyAttributesWithDifferingValues(nodeToCompareAgainst, childNode) && InnerTextIsSame(nodeToCompareAgainst, childNode))
            {
                MarkNodeForDeletion(nodeToCompareAgainst);
            }
            else if (!AnyAttributesWithDifferingValues(nodeToCompareAgainst, childNode))
            {
                MarkNodeForDeletion(childNode);
            }
        }

        // make sure we compare all of the childNodes descendants to the childNode
        PrepNodesForDeletion(childNode, childNode);
    }
}

And then the following method which would delete the marked node:-

static void RemoveMarkedNodes(XmlDocument document)
{
    // in order for us to make sure we remove everything we meant to remove, we need to do this in a while loop
    // for instance, if the original xml is = <a><a><b><a/></b></a><a/></a>
    // this should result in the xml being passed into this function as:
    // <a><b><a DeleteNode="TRUE" /></b><a DeleteNode="TRUE"><b><a DeleteNode="TRUE" /></b></a><a DeleteNode="TRUE" /></a>
    // then this function (without the while) will not delete the last <a/>, even though it is marked for deletion
    // if we incorporate a while loop, then we can insure all nodes marked for deletion are removed
    // TODO: understand the reason for this -- see http://groups.google.com/group/microsoft.public.dotnet.xml/browse_thread/thread/25df058a4efb5698/7dd0a8b71739216c?lnk=st&q=xmlnode+removechild+recursive&rnum=2&hl=en#7dd0a8b71739216c
    XmlNodeList nodesToDelete = document.SelectNodes("//*[@DeleteNode='TRUE']");

    while (nodesToDelete.Count > 0)
    {
        foreach (XmlNode nodeToDelete in nodesToDelete)
        {
            nodeToDelete.ParentNode.RemoveChild(nodeToDelete);
        }

        nodesToDelete = document.SelectNodes("//*[@DeleteNode='TRUE']");
    }
}

When I use the PrepNodesForDeletion method without the infinityIndex counter, I get OutOfMemoryException for few HTML contents. However, If I use infinityIndex counter, It may not be deleting nodes for some HTML contents.

Could anybody suggest any way to remove recursion. Also I am not familiar with the HtmlAgility pack. So, If this can be done using that, could somebody provide some code sample.

Upvotes: 0

Views: 1341

Answers (3)

Yaur
Yaur

Reputation: 7452

Your problem is that you have badly formed XML and as a direct result your DOM is a mess. What I think you are going to have to do is to use a SAX parser (which must exist for .net) and implement the logic to fix the DOM yourself which appears to be what you re trying to do.

This method isn't recursive but is going to require you to do some work that you didn't realize that you needed to do.

also note that you are getting an out of memory exception and not a stack overflow exception which reinforces the idea that too much recursion is not your problem per se.

Upvotes: 0

ChrisWue
ChrisWue

Reputation: 19020

Well, if I understand your algorithm correctly you want to do this: For each node in the tree compare it against all its child nodes in a non-recursive fashion, correct?

    // walk the tree in DFS
    public void XmlTreeWalk(XmlNode root, Action<XmlNode, XmlNode> action)
    {
        var nodesToCompare = new Stack<XmlNode>();
        foreach (XmlNode child in root.ChildNodes)
        {
            nodesToCompare.Push(child);
        }
        while (nodesToCompare.Count > 0)
        {
            var top = nodesToCompare.Pop();
            action(root, top);
            foreach (XmlNode child in top.ChildNodes)
            {
                nodesToCompare.Push(child);
            }
        }
    }

    // for each node: prepare all its children for deletion
    public void PrepareForDeletion(XmlNode root)
    {
        XmlTreeWalk(root, (r, c) => PrepareSubtreeForDeletion(r, c));
    }

    // for each node, compare all its children against the toCompare node
    private void PrepareSubtreeForDeletion(XmlNode toCompare, XmlNode root)
    {
        XmlTreeWalk(root, (unused, current) => MarkNodeForDeletion(toCompare, current));
    }

    // your delete logic
    public void MarkNodeForDeletion(XmlNode toCompare, XmlNode toCompareAgains)
    {
       ...
    }

What this should do is: Walk the tree top to bottom and for each node walk the subtree of that node comparing all children against this node.

I haven't tested it so it might contain bugs but the idea should be clear. Apparently this algorithm is O(n^2).

Upvotes: 1

Chuck Savage
Chuck Savage

Reputation: 11955

To remove recursion, the childs and parents must know about each other.

Then you can traverse say down the right leg from the root parent, until you reach the right most bottom leg.

And then from there, go up one, then down left one, and then down right until bottom. Repeat up one, down left, and then right as far as possible, etc. until you have looped over the entire tree structure.

I'm not sure on what you are attempting to do, to suggest how to use this method on your problem.

Upvotes: 0

Related Questions