Reputation: 342
I am trying to parse a website. I need some links in HTML file which contains some specific words. I know how to find "href" attributes but I don't need all of them, is there anyway to do that? For example can I use regex in HtmlAgilityPack?
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[@id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[@]"))
{
this.dgvurl.Rows.Add(urls.Attributes["href"].Value);
}
I'm trying this for finding all links in HTML code.
Upvotes: 1
Views: 668
Reputation: 5822
If you have an HTML file like this:
<div class="a">
<a href="http://www.website.com/"></a>
<a href="http://www.website.com/notfound"></a>
<a href="http://www.website.com/theword"></a>
<a href="http://www.website.com/sub/theword"></a>
<a href="http://www.website.com/theword.html"></a>
<a href="http://www.website.com/other"></a>
</div>
And you're searching for example the following words: theword
and other
. You can define a regular expression, then use LINQ to get the links with an attribute href
matching your regular expression like this:
Regex regex = new Regex("(theworld|other)", RegexOptions.IgnoreCase);
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='a']");
List<HtmlNode> nodeList = node.SelectNodes(".//a").Where(a => regex.IsMatch(a.Attributes["href"].Value)).ToList<HtmlNode>();
List<string> urls = new List<string>();
foreach (HtmlNode n in nodeList)
{
urls.Add(n.Attributes["href"].Value);
}
Note that there's a contains
keyword with XPATH, but you'll have to duplicate the condition for each word you're searching like:
node.SelectNodes(".//a[contains(@href,'theword') or contains(@href,'other')]")
There's also a matches
keyword for XPATH, unfortunately it's only available with XPATH 2.0 and HtmlAgilityPack uses XPATH 1.0. With XPATH 2.0, you could do something like this:
node.SelectNodes(".//a[matches(@href,'(theword|other)')]")
Upvotes: 1
Reputation: 342
I Find this and that works for me.
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[@id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[@]"))
{
var temp = catagory.Attributes["href"].Value;
if (temp.Contains("some_word"))
{
dgv.Rows.Add(temp);
}
}
Upvotes: 0