DataWrangler
DataWrangler

Reputation: 2165

Parse the Nodes of XML files

How to parse all the XML files under a given directory as an input to the application and write its output to a text file.

Note: The XML is not always the same the nodes in the XML can vary and have any number of Child-nodes.

Any help or guidance would be really helpful on this regard :)

XML File Sample

<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>
<CNT>USA</CNT>
<CODE>3456</CODE>
</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Hide your heart</TITLE>
<ARTIST>Bonnie Tyler</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>CBS Records</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1988</YEAR>
</CD>
</CATALOG>

C# Code

using System;
using System.Collections.Generic;
using System.Windows.Forms;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Data;
using System.Xml;
using System.Xml.Linq;

namespace XMLTagParser
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Please Enter the Location of the file");

            // get the location we want to get the sitemaps from 
            string dirLoc = Console.ReadLine();

            // get all the sitemaps 
            string[] sitemaps = Directory.GetFiles(dirLoc);
            StreamWriter sw = new StreamWriter(Application.StartupPath + @"\locs.txt", true);

            // loop through each file 
            foreach (string sitemap in sitemaps)
            {
                try
                {
                    // new xdoc instance 
                    XmlDocument xDoc = new XmlDocument();

                    //load up the xml from the location 
                    xDoc.Load(sitemap);

                    // cycle through each child noed 
                    foreach (XmlNode node in xDoc.DocumentElement.ChildNodes)
                    {
                        // first node is the url ... have to go to nexted loc node 
                        foreach (XmlNode locNode in node)
                        {

                                string loc = locNode.Name;

                                // write it to the console so you can see its working 
                                Console.WriteLine(loc + Environment.NewLine);

                                // write it to the file 
                                sw.Write(loc + Environment.NewLine);
                            }
                        }
                    }
                catch {
                    Console.WriteLine("Error :-(");
                }
            }
            Console.WriteLine("All Done :-)");
            Console.ReadLine();
        }
    }
}

Preferred Output:

CATALOG/CD/TITLE
CATALOG/CD/ARTIST
CATALOG/CD/COUNTRY/CNT
CATALOG/CD/COUNTRY/CODE
CATALOG/CD/COMPANY
CATALOG/CD/PRICE
CATALOG/CD/YEAR

CATALOG/CD/TITLE
CATALOG/CD/ARTIST
CATALOG/CD/COUNTRY
CATALOG/CD/COMPANY
CATALOG/CD/PRICE
CATALOG/CD/YEAR

Upvotes: 2

Views: 137

Answers (1)

Ste Griffiths
Ste Griffiths

Reputation: 318

This is a recursive problem, and what you are looking for is called 'tree traversal'. What this means is that for each child node, you want to look into it's children, then into that node's children (if it has any) and so on, recording the 'path' as you go along, but only printing out the names of the 'leaf' nodes.

You will need a function like this to 'traverse' the tree:

static void traverse(XmlNodeList nodes, string parentPath)
{
    foreach (XmlNode node in nodes)
    {
        string thisPath = parentPath;
        if (node.NodeType != XmlNodeType.Text)
        {
            //Prevent adding "#text" at the end of every chain
            thisPath += "/" + node.Name;
        }

        if (!node.HasChildNodes)
        {
            //Only print out this path if it is at the end of a chain
            Console.WriteLine(thisPath);
        }

        //Look into the child nodes using this function recursively
        traverse(node.ChildNodes, thisPath);
    }
}

And then here is how I would add it into your program (within your foreach sitemap loop):

try
{
    // new xdoc instance 
    XmlDocument xDoc = new XmlDocument();

    //load up the xml from the location 
    xDoc.Load(sitemap);

    // start traversing from the children of the root node
    var rootNode = xDoc.FirstChild;
    traverse(rootNode.ChildNodes, rootNode.Name);
}
catch
{
    Console.WriteLine("Error :-(");
}

I made use of this other helpful answer: Traverse a XML using Recursive function

Hope this helps! :)

Upvotes: 2

Related Questions