iaacp
iaacp

Reputation: 4845

Get only the text of a webpage using HTML Agility Pack?

I'm trying to scrape a web page to get just the text. I'm putting each word into a dictionary and counting how many times each word appears on the page. I'm trying to use HTML Agility Pack as suggested from this post: How to get number of words on a web page?

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
int wordCount = 0;
Dictionary<string, int> dict = new Dictionary<string, int>();

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    MatchCollection matches = Regex.Matches(node.InnerText, @"\b(?:[a-z]{2,}|[ai])\b", RegexOptions.IgnoreCase);
    foreach (Match s in matches)
    {
       //Add the entry to the dictionary
    }
}

However, with my current implementation, I'm still getting lots of results that are from the markup that should not be counted. It's close, but not quite there yet (I don't expect it to be perfect).

I'm using this page as an example. My results are showing a lot of the uses of the words "width" and "googletag", despite those not being in the actual text of the page at all.

Any suggestions on how to fix this? Thanks!

Upvotes: 2

Views: 1295

Answers (1)

Mert Akcakaya
Mert Akcakaya

Reputation: 3149

You can't be sure that the word you are searching for is displayed or not to the user as there will be JS execution and CSS rules that will affect that.

The following program does find 0 matches for "width", and "googletag" but finds 126 "html" matches whereas Chrome Ctrl+F finds 106 matches.

Note that the program does not match the word if it's parent node is <script>.

using HtmlAgilityPack;
using System;

namespace WordCounter
{
    class Program
    {
        private static readonly Uri Uri = new Uri("https://www.w3schools.com/html/html_editors.asp");

        static void Main(string[] args)
        {
            var doc = new HtmlWeb().Load(Uri);
            var nodes = doc.DocumentNode.SelectSingleNode("//body").DescendantsAndSelf();
            var word = Console.ReadLine().ToLower();
            while (word != "exit")
            {
                var count = 0;
                foreach (var node in nodes)
                {
                    if (node.NodeType == HtmlNodeType.Text && node.ParentNode.Name != "script" && node.InnerText.ToLower().Contains(word))
                    {
                        count++;
                    }
                }

                Console.WriteLine($"{word} is displayed {count} times.");
                word = Console.ReadLine().ToLower();
            }
        }
    }
}

Upvotes: 3

Related Questions