ssd
ssd

Reputation: 2442

How to parse an html node containing multiple tags?

I'm able to reach the node I want to extract but couldn't figure out how to separate different tags within the node.

p.s. I'm OK with regular expression; just curious if a simpler way with Html Agility Pack exists or not.

Code:

...
...
HtmlNodeCollection nodes = webContent.DocumentNode.SelectNodes("//*[@id='node-name']/ul/li");

foreach (HtmlNode node in nodes) {
    String link = ???; // extract the http link here (href)
    String text = ???; // extract the inner text here
    String nums = ???; // extract the content of <small> tag here
    ...
}

html sample:

...
...
<ul class="some-class-name">
  <li>
    <a href="http://link-1.com">text for link 1<small>1</small></a>
  </li>
  <li>
    <a href="http://link-2.org">text for link 2<small>2</small></a>
  </li>
  <li>
    <a href="http://link-3.net">text for link 3<small>3</small></a>
  </li>
</ul>
...
...

Upvotes: 0

Views: 380

Answers (1)

Veverke
Veverke

Reputation: 11408

You can use either Element(s) or Descendants, from the native API.
Keep in mind that you can use extensions such as this to enable css selector querying, which in my understanding is the preferred (and easiest) way.

Follows a code snippet:

    //https://stackoverflow.com/q/70203208/1219280
    var doc = new HtmlDocument();
    doc.LoadHtml(@"
        <ul class='some -class-name'>
          <li>
            <a href = 'http://link-1.com' > text for link 1<small>1</small></a>
          </li>
          <li>
            <a href = 'http://link-2.org' > text for link 2<small>2</small></a>
          </li>
          <li>
            <a href = 'http://link-3.net' > text for link 3<small>3</small></a>
          </li>
        </ul>
    ");

    Console.WriteLine("-------------------- Using Element(s) -------------------------");

    //using Element(s), queries children in the next level only
    var ul = doc.DocumentNode.Element("ul");
    var lis = ul.Elements("li");
    foreach(var li in lis)
    {
        var a = li.Element("a");
        var href = a?.GetAttributeValue("href");
        var smallText = a.Element("small")?.InnerText;

        Console.WriteLine($"a href: [{href}] small: [{smallText}]");
    }

    Console.WriteLine("-------------------- Using Descendants -------------------------");

    //using Descendants
    var anchors = doc.DocumentNode.Descendants("a");
    foreach(var a in anchors)
    {
        var href = a?.GetAttributeValue("href");
        var smallText = a.Element("small")?.InnerText;

        Console.WriteLine($"a href: [{href}] small: [{smallText}]");
    }

Output:

enter image description here

Upvotes: 1

Related Questions