Using C# and Regex to find and surround all words and numbers within some html text with a span

Question

I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...

1) Special html characters like ” “ are treated as words.

2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")

3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"

I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions. What I have so far is:

string pattern = @"(?]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
                  wordCnt++;
                  return "" + m.Value + "";
 });

How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?

Joe DeCock · Accepted Answer

A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.

What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.

Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like

Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.

You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.

class Program
{
    static void Main(string[] args)
    {
        var doc = new HtmlDocument();
        doc.Load(args[0]);
        var wordCount = 0;
        var nodes = doc.DocumentNode
                       .SelectNodes("/html/body//*[not(self::script)]/text()");
        foreach (var node in nodes)
        {
            var words = node.InnerHtml.Split(' ');
            var surroundedWords = words.Select(word =>
            {
                if (String.IsNullOrWhiteSpace(word))
                {
                    return word;
                }
                else
                {
                    return $"{word}";
                }
            });
            var newInnerHtml = String.Join("", surroundedWords);
            node.InnerHtml = newInnerHtml;
        }

        WriteLine(doc.DocumentNode.InnerHtml);
    }
}

Using C# and Regex to find and surround all words and numbers within some html text with a span

Answers (2)

Related Questions