Shiwers
Shiwers

Reputation: 27

How to extract data from webpage using C#

I'm trying to extract text from this HTML tag

<span id="example1">sometext</span>

And I have this code:

using System;
using System.Net;
using HtmlAgilityPack;

namespace GC_data_console
{
    class Program
    {
        public static void Main(string[] args)
        {           
            using (var client = new WebClient())
            {
                // Download the HTML
                string html =
                    client.DownloadString("https://www.requestedwebsite.com");
                HtmlDocument doc = new HtmlDocument();
                doc.LoadHtml(html);
                foreach(HtmlNode link in
                        doc.DocumentNode.SelectNodes("//span"))
                {
                    HtmlAttribute href = link.Attributes["id='example1'"];
                    if (href != null)
                    {
                        Console.WriteLine(href.Value.ToString());
                        Console.ReadLine();
                    }
                }
            }
        }
    }
}

But I am still not getting the text sometext. But if I insert:

HtmlAttribute href = link.Attributes["id"];

I'll get all the IDs names. What am I doing wrong?

Upvotes: 1

Views: 2513

Answers (1)

Chetan
Chetan

Reputation: 6891

You need to first understand difference between HTML Node and HTMLAttribute. You code is nowhere near to solve the problem.

HTMLNode represents the tags used in HTML such as span,div,p,a and lot other. HTMLAttribute represents attribute which are used for the HTMLNodes such as href attribute is used for a, and style,class, id, name etc. attributes are used for almost all the HTML tags.

In below HTML

<span id="firstName" style="color:#232323">Some Firstname</span>

span is HTMLNode while id and style are the HTMLAttributes. and you can get value Some FirstName by using HtmlNode.InnerText property.

Also selecting HTMLNodes from HtmlDocument is not that straight forward. You need to provide proper XPath to select node you want.

Now in your code if you want to get the text written in <span id="ctl00_ContentBody_CacheName">SliverCup Studios East</span>, which is part of HTML of someurl.com, you need to write following code.

using (var client = new WebClient())
{
    string html = client.DownloadString("https://www.someurl.com");

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

   //Selecting all the nodes with tagname `span` having "id=ctl00_ContentBody_CacheName".
    var nodes = doc.DocumentNode.SelectNodes("//span")
        .Where(d => d.Attributes.Contains("id"))
        .Where(d => d.Attributes["id"].Value == "ctl00_ContentBody_CacheName");

    foreach (HtmlNode node in nodes)
    {
        Console.WriteLine(node.InnerText);
    }
}

The above code will select all the span tags which are directly under the document node of the HTML. Tags which are located deep inside the hierarchy you need to use different XPath.

This should help you resolve your issue.

Upvotes: 1

Related Questions