CARASS
CARASS

Reputation: 245

Extract information from a large XML file

I need to get some urls from a large xml file.

Xml file has the below structure.

<Main>
 <Product>
  <Images>
   <URL>image1.jpg</URL>
   <URL>image2.jpg</URL>
   <URL>image3.jpg</URL>
   <URL>image4.jpg</URL>
  </Images>
 </Product>

......

I need to extract all the links into a text file. Have any ideea on how to do this /?

Upvotes: 3

Views: 1030

Answers (4)

mirod
mirod

Reputation: 16136

If you have Perl installed (or you can install it), you can use xml_grep, which comes with XML::Twig (available in Activestate Perl, or in Strawberry Perl or of course on centOS).

xml_grep --text_only URL product_file.xml > url.txt

It can deal with very large files, since it works in stream mode.

Upvotes: 3

Rastus7
Rastus7

Reputation: 416

How about using XPath to retrieve the nodes you need? You could then write the contents of that list into a text file. Here's some code in C# that should do the job for you:

public static void Main(string[] Arguments)
{
    XmlDocument oXmlDocument = new XmlDocument();
    oXmlDocument.Load(@"XmlFile.xml");

    using (StreamWriter oStreamWriter = new StreamWriter(File.OpenWrite(@"Output.txt")))
    {
        XmlNodeList oXmlNodeList = oXmlDocument.SelectNodes("//URL");

        oXmlNodeList.OfType<XmlNode>().ToList<XmlNode>().ForEach(m => oStreamWriter.WriteLine(m.InnerText));
    }
}

If the document is huge, it might be better to consider a SAX approach rather than using the DOM.

I hope that helps.

Upvotes: 1

David Schwartz
David Schwartz

Reputation: 2006

The following is an example that should load the XML you've pasted. You'll need to add System.Xml.Linq because it uses LINQ to XML. First we load the XML document using XDocument.Load(...):

// Load the raw XML into an XDocument.
var doc = XDocument.Load(new StringReader(@"<Main>
 <Product>
  <Images>
   <URL>image1.jpg</URL>
   <URL>image2.jpg</URL>
   <URL>image3.jpg</URL>
   <URL>image4.jpg</URL>
  </Images>
 </Product>
</Main>"));

I use a StringReader and the example string, but you should change it to something that loads your XML file. For example, XDocument.Load("C:\folder\file.xml") will load a file (see XDocument.Load(string)).

// Create a list to store the URLs in.
var urls = new List<string>();

// Get the <Main> element.
var mainNode = doc.Element("Main");

// Loop through the <Product> elements...
foreach (var productNode in mainNode.Elements("Product"))
{
    // Loop through the <Images> elements (if there's more than one).
    foreach (var imagesNode in productNode.Elements("Images"))
    {
        // Loop through the <URL> elements...
        foreach (var urlNode in imagesNode.Elements("URL"))
        {
            // The "Value" property is the text within the element.
            urls.Add(urlNode.Value);
        }
    }
}

// Write the URLs out to the Debug output.
foreach (var url in urls)
    Debug.WriteLine(url);

At this point, you'll have a list of URLs. If you want to write them to a file, here's an example:

// Create an output file.
using(var outputFile = File.Create("output.txt"))
{
    var writer = new StreamWriter(outputFile);
    foreach (var url in urls)
        writer.WriteLine(url);
}

You don't necessarily have to create the list and the write the list to the file like I did-- you could just write the URLs to the text file as you read them.

Let me know if there's anything else I can do to help.

Upvotes: 0

Florian Eck
Florian Eck

Reputation: 495

do you only need the urls?

the given structure looks like the urls a associated to the image/product data? if you dont care about the other data and only need all urls, a regexp should be the way to go

Upvotes: 0

Related Questions