BernardG
BernardG

Reputation: 1966

Better way to use LINQ To XML for an HTML Page

I am looking for specific items on a web page. What I did (to test, so far) is working just fine, but is really ugly to my eyes. I would like to get suggestions to do this in a more concise manner, that is ONE Linq query instead of 2 now....

        document.GetXDocument();
        string xmlns = "{http://www.w3.org/1999/xhtml}";
        var AllElements = from AnyElement in document.fullPage.Descendants(xmlns + "div")
                          where AnyElement.Attribute("id") != null && AnyElement.Attribute("id").Value == "maincolumn"
                          select AnyElement;
        // this first query bring only one LARGE Element.

        XDocument subdocument = new XDocument(AllElements);

        var myElements = from item in subdocument.Descendants(xmlns + "img")
                         where String.IsNullOrEmpty(item.Attribute("src").Value.Trim()) != true
                         select item;

        foreach (var element in myElements)
        {   
            Console.WriteLine(element.Attribute("src").Value.Trim());                                                          
        }
        Assert.IsNotNull(myElements.Count());

I know I could directly look for "img", but I want to be able to get other types of items in those pages, like links and some text.

I strongly doubt this is the best way!

Upvotes: 3

Views: 1504

Answers (2)

Zev Spitz
Zev Spitz

Reputation: 15357

If you insist on parsing the web page as XML, try this:

var elements =
    from element in document.Descendants(xmlns + "div")
    where (string)element.Attribute("id") == "maincolumn"
    from element2 in element.Descendants(xmlns + "img")
    let src = ((string)element2.Attribute("src")).Trim()
    where String.IsNullOrEmpty(src)
        select new {
            element2,
            src
    };

foreach (var item in elements) {
    Console.WriteLine(item.src);
}

Notes:

  • What is the type of document? I am assuming it's an XDocument. If that is the case, you can use Descendants directly on XDocument. (OTOTH if document is an XDocument, where does that fullPath property come from?)
  • Cast the XAttribute to a string. If it's empty, the result of the cast will be null. This will save on the double check. (This doesn't offer any performance benefits.)
  • Use let to "save" a value for later reuse, in this case for use in the foreach. Unless all you need is that final Assert, in which case it might be more efficient to use Any instead of Count. Any only has to iterate over the first result in order to return a value; Count has to iterate over all of them.
  • Why is subdocument of type XDocument? Wouldn't XElement be the appropriate type?
  • You can also use String.IsNullOrWhitespace to check for whitespace in src, instead of String.IsNullOrEmpty, assuming you want to process the src as is, with any whitespace it might have.

Upvotes: 0

tukaef
tukaef

Reputation: 9214

The same logic in single query:

var myElements = from element in document.fullPage.Descendants(xmlns + "div")
                          where element.Attribute("id") != null 
                          && element.Attribute("id").Value == "maincolumn"
                          from item in new XDocument(element).Descendants(xmlns + "img")
                          where !String.IsNullOrEmpty(item.Attribute("src").Value.Trim()) 
                          select item;

Upvotes: 1

Related Questions