ElektroStudios
ElektroStudios

Reputation: 20494

Html nodes issue with HtmlAgilityPack

I'm having a big trouble trying to parse these html contents with HtmlAgilityPack library.

In this piece of code, I would like to retrieve only the url (href) that reffers to uploaded.net, but I can't determine whether the url reffers to it.

<div class='downloads' id='download_block'>

    <h5 style='text-align:center'>FREE DOWNLOAD LINKS</h5>

    <h4>uploadable.ch</h4>
    <ul class='parts'>
        <li>
            <a href="http://url/..." target="_blank"> text here</a>
        </li>
    </ul>

    <h4>uploaded.net</h4>
    <ul class='parts'>
        <li>
            <a href="http://url/..." target="_blank"> text here</a>
        </li>
    </ul>

    <h4>novafile.com</h4>
    <ul class='parts'>
        <li>
            <a href="http://url/..." target="_blank"> text here</a>
        </li>
    </ul>

</div>

This is how it looks on the webpage

enter image description here

And this is what I have:

nodes = myHrmlDoc.DocumentNode.SelectNodes(".//div[@class='downloads']/ul[@class='parts']")

I can't just use an array-index to determine the position like:

nodes(0) = uploadable.ch node
nodes(1) = uploaded.net node
nodes(2) = novafile.com node

...because they could change the amount of nodes and its hosting positions.

Note that also the urls will not contains the hosting names, are redirections like:

http://xxxxxx/r/YEHUgL44xONfQAnCNUVw_aYfY5JYAy0DT-i--

What could I do, in C# or else VB.Net?.

Upvotes: 2

Views: 382

Answers (3)

Xi Sigma
Xi Sigma

Reputation: 2372

this should do, untested though:

doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'uploaded.net')]/following-sibling::ul//a").Attributes["href"].Value

also use contains because you never know if the text contains spaces.

Upvotes: 2

Matt
Matt

Reputation: 3680

Give the snippet you supplied, this will help you get started.

var page = "<div class=\"downloads\" id=\"download_block\">    <h5 style=\"text-align:center\">FREE DOWNLOAD LINKS</h5>    <h4>uploadable.ch</h4>    <ul class=\"parts\">        <li>            <a href=\"http://url/...\" target=\"_blank\"> text here</a>        </li>    </ul>    <h4>uploaded.net</h4>    <ul class=\"parts\">        <li>            <a href=\"http://url/...\" target=\"_blank\"> text here</a>        </li>    </ul>    <h4>novafile.com</h4>    <ul class=\"parts\">        <li>            <a href=\"http://url/...\" target=\"_blank\"> text here</a>        </li>    </ul></div>";

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);

var nodes = doc.DocumentNode.Descendants("h4").Where(n => n.InnerText.Contains("uploadable"));
foreach (var node in nodes)
{
    var attr = node.NextSibling.NextSibling.Descendants().Where(x=> x.Name == "a").FirstOrDefault().Attributes["href"];
    attr.Value.Dump();
}

Upvotes: 1

TyCobb
TyCobb

Reputation: 9089

The only way I see this working is 2 fold approach. Sorry, I don't have HtmlAgilityPack at hand, but here is an example of using the standard XmlDocument. Even though you said you can't use array indexes to access, this process should allow you to do that by specifically grabbing the correct index dynamically.

void Main()
{
    var xml = @"
<div class=""downloads"" id=""download_block"">
    <h5 style=""text-align:center"">FREE DOWNLOAD LINKS</h5>
    <h4>uploadable.ch</h4>
    <ul class=""parts"">
        <li>
            <a href=""http://url/..."" target=""_blank""> text here</a>
        </li>
    </ul>
    <h4>uploaded.net</h4>
    <ul class=""parts"">
        <li>
            <a href=""http://upload.net/..."" target=""_blank""> text here</a>
        </li>
    </ul>
    <h4>novafile.com</h4>
    <ul class=""parts"">
        <li>
            <a href=""http://url/..."" target=""_blank""> text here</a>
        </li>
    </ul>
</div>";

 var xmlDocument = new XmlDocument();
 xmlDocument.LoadXml(xml);

 var nav = xmlDocument.CreateNavigator();
 var index = nav.Evaluate("count(//h4[text()='uploaded.net']/preceding-sibling::h4)+1").ToString();
 var text = xmlDocument.SelectSingleNode("//ul["+index +"]//a/@href").InnerText;

 Console.WriteLine(text);
}

Basically, it gets the index of the uploaded.net h4 and then uses that index to select the correct ul tag and get the URL out the of underlying anchor tag.

Sorry for the not so clean and error prone code, but it should get you in the right direction.

Upvotes: 1

Related Questions