Get only the text of a webpage using HTML Agility Pack?

Question

I'm trying to scrape a web page to get just the text. I'm putting each word into a dictionary and counting how many times each word appears on the page. I'm trying to use HTML Agility Pack as suggested from this post: How to get number of words on a web page?

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
int wordCount = 0;
Dictionary dict = new Dictionary();

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    MatchCollection matches = Regex.Matches(node.InnerText, @"\b(?:[a-z]{2,}|[ai])\b", RegexOptions.IgnoreCase);
    foreach (Match s in matches)
    {
       //Add the entry to the dictionary
    }
}

However, with my current implementation, I'm still getting lots of results that are from the markup that should not be counted. It's close, but not quite there yet (I don't expect it to be perfect).

I'm using this page as an example. My results are showing a lot of the uses of the words "width" and "googletag", despite those not being in the actual text of the page at all.

Any suggestions on how to fix this? Thanks!

Get only the text of a webpage using HTML Agility Pack?

Answers (1)

Related Questions