User987
User987

Reputation: 3825

HtmlAgilityPack filtering HTML based on a query

I have a block of two HTML elements which look like this:

<div class="a-row">
    <a class="a-size-small a-link-normal a-text-normal" href="/Chemical-Guys-CWS-107-Extreme-Synthetic/dp/B003U4P3U0/ref=sr_1_1_sns?s=automotive&amp;ie=UTF8&amp;qid=1504525216&amp;sr=1-1">
        <span aria-label="$19.51" class="a-color-base sx-zero-spacing">
            <span class="sx-price sx-price-large">
                <sup class="sx-price-currency">$</sup>
                <span class="sx-price-whole">19</span>
                <sup class="sx-price-fractional">51</sup>
            </span>
        </span>
        <span class="a-letter-space"></span>Subscribe &amp; Save
    </a>
</div>

And next block of HTML:

<div class="a-row a-spacing-none">
    <a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B003U4P3U0" rel="nofollow noreferrer">
        <span aria-label="$22.95" class="a-color-base sx-zero-spacing">
            <span class="sx-price sx-price-large">
                <sup class="sx-price-currency">$</sup>
                <span class="sx-price-whole">22</span>
                <sup class="sx-price-fractional">95</sup>
            </span>
         </span>
    </a>
    <span class="a-letter-space"></span>
    <i class="a-icon a-icon-prime a-icon-small s-align-text-bottom" aria-label="Prime">
        <span class="a-icon-alt">Prime</span>
    </i>
</div>

Both of these elements are quite similar in their structure, but the trick is that I want to extract the value of element which next to it contains a span element with a class: aria-label="Prime"

This is how I currently extract the price but it's not good:

if (htmlDoc.DocumentNode.SelectNodes("//span[@class='a-color-base sx-zero-spacing']") != null)
{
    var span = htmlDoc.DocumentNode.SelectSingleNode("//span[@class='a-color-base sx-zero-spacing']");
    price = span.Attributes["aria-label"].Value;
}

This basically selects HTML element at position 0, since there are more than one element. But the trick here is that I would like to select that span element which contains the prime value , just like the 2nd piece of HTML I've shown... In case the 2nd element with such values doesn't exists I would just simply use this first method I wrote up there...

Can someone help me out with this ? =)

I've also tried something like this:

 var pr = htmlDoc.DocumentNode.SelectNodes("//a[@class='a-link-normal a-text-normal']")
    .Where(x => x.SelectSingleNode("//i[@class='a-icon a-icon-prime a-icon-small s-align-text-bottom']") != null)
    .Select(x => x.SelectSingleNode("//span[@class='a-color-base sx-zero-spacing']").Attributes["aria-label"].Value);

But it's still returning first element xD

New version guys:

 var pr = htmlDoc.DocumentNode.SelectNodes("//a[@class='a-link-normal a-text-normal']");
 string prrrrrr = "";
 for (int i = 0; i < pr.Count; i++)
   {
    if (pr.ElementAt(i).SelectNodes("//i[@class='a-icon a-icon-prime a-icon-small s-align-text-bottom']").ElementAt(i) != null)
   {
    prrrrrr = pr.ElementAt(i).SelectNodes("//span[@class='a-color-base sx-zero-spacing']").ElementAt(i).Attributes["aria-label"].Value;

    }
}

So the idea is that I take out all "a" elements from the HTML file and create a HTML Node collection of a's, and then loop through them and see which one indeed contains the element that I'm looking for and then match it...?

The problem here is that this if statement always passes:

 if (pr.ElementAt(i).SelectNodes("//i[@class='a-icon a-icon-prime a-icon-small s-align-text-bottom']").ElementAt(i) != null)

How can I loop through each individual element in node collection ?

Upvotes: 2

Views: 1450

Answers (1)

krlzlx
krlzlx

Reputation: 5832

I think you should start to look at div level with class a-row. Then loop and check if the div contains a i with class area-label equals to 'Prime'. And finally get the span with the a-color-base sx-zero-spacing class and the value of the attribute aria-label like this:

HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//div[starts-with(@class,'a-row')]");

foreach (HtmlNode node in nodes)
{
    HtmlNode i = node.SelectSingleNode("i[@aria-label='Prime']");

    if (i != null)
    {
        HtmlNode span = node.SelectSingleNode(".//span[@class='a-color-base sx-zero-spacing']");

        if (span != null)
        {
            string currentValue = span.Attributes["aria-label"].Value;
        }
    }
}

Upvotes: 1

Related Questions