Reputation: 27
I'm trying to extract text from this HTML tag
<span id="example1">sometext</span>
And I have this code:
using System;
using System.Net;
using HtmlAgilityPack;
namespace GC_data_console
{
class Program
{
public static void Main(string[] args)
{
using (var client = new WebClient())
{
// Download the HTML
string html =
client.DownloadString("https://www.requestedwebsite.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode link in
doc.DocumentNode.SelectNodes("//span"))
{
HtmlAttribute href = link.Attributes["id='example1'"];
if (href != null)
{
Console.WriteLine(href.Value.ToString());
Console.ReadLine();
}
}
}
}
}
}
But I am still not getting the text sometext
.
But if I insert:
HtmlAttribute href = link.Attributes["id"];
I'll get all the IDs names. What am I doing wrong?
Upvotes: 1
Views: 2513
Reputation: 6891
You need to first understand difference between HTML Node and HTMLAttribute. You code is nowhere near to solve the problem.
HTMLNode represents the tags used in HTML such as span
,div
,p
,a
and lot other. HTMLAttribute represents attribute which are used for the HTMLNodes such as href
attribute is used for a
, and style
,class
, id
, name
etc. attributes are used for almost all the HTML tags.
In below HTML
<span id="firstName" style="color:#232323">Some Firstname</span>
span
is HTMLNode while id
and style
are the HTMLAttributes. and you can get value Some FirstName
by using HtmlNode.InnerText property.
Also selecting HTMLNodes from HtmlDocument is not that straight forward. You need to provide proper XPath to select node you want.
Now in your code if you want to get the text written in <span id="ctl00_ContentBody_CacheName">SliverCup Studios East</span>
, which is part of HTML of someurl.com
, you need to write following code.
using (var client = new WebClient())
{
string html = client.DownloadString("https://www.someurl.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//Selecting all the nodes with tagname `span` having "id=ctl00_ContentBody_CacheName".
var nodes = doc.DocumentNode.SelectNodes("//span")
.Where(d => d.Attributes.Contains("id"))
.Where(d => d.Attributes["id"].Value == "ctl00_ContentBody_CacheName");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
The above code will select all the span
tags which are directly under the document node of the HTML. Tags which are located deep inside the hierarchy you need to use different XPath.
This should help you resolve your issue.
Upvotes: 1