Reputation: 95
I have a html document that contains multiple divs
Example:
<div class="element">
<div class="title">
<a href="127.0.0.1" title="Test>Test</a>
</div>
</div>
Now I'm using this code to extract the title element.
List<string> items = new List<string>();
var nodes = Web.DocumentNode.SelectNodes("//*[@title]");
if (nodes != null)
{
foreach (var node in nodes)
{
foreach (var attribute in node.Attributes)
if (attribute.Name == "title")
items.Add(attribute.Value);
}
}
I don't know how to adapt my code to extract the href and the title element at the same time.
Each div should be an object with the included a tags as properties.
public class CheckBoxListItem
{
public string Text { get; set; }
public string Href { get; set; }
}
Upvotes: 1
Views: 1623
Reputation: 4817
You can use the following xpath query to retrieve only a tags with a title and href :
//a[@title and @href]
The you can use your code like this:
List<CheckBoxListItem> items = new List<CheckBoxListItem>();
var nodes = Web.DocumentNode.SelectNodes("//a[@title and @href]");
if (nodes != null)
{
foreach (var node in nodes)
{
items.Add(new CheckBoxListItem()
{
Text = node.Attributes["title"].Value,
Href = node.Attributes["href"].Value
});
}
}
Upvotes: 1
Reputation: 11358
I very often use ScrapySharp's package together with HtmlAgilityPack for css selection.
(add a using statement for ScrapySharp.Extensions so you can use the CssSelect method).
using HtmlAgilityPack;
using ScrapySharp.Extensions;
In your case, I would do:
HtmlWeb w = new HtmlWeb();
var htmlDoc = w.Load("myUrl");
var titles = htmlDoc.DocumentNode.CssSelect(".title");
foreach (var title in titles)
{
string href = string.Empty;
var anchor = title.CssSelect("a").FirstOrDefault();
if (anchor != null)
{
href = anchor.GetAttributeValue("href");
}
}
Upvotes: 1