Jack
Jack

Reputation: 33

Use C# to reparse a string already containing asterisk character replacements

I received a very helpful response to a previous question raised here.

Use C# to surround phrases in a string with asterisk characters from a dictionary of phrases

I am now posting a follow-up question for a specific issue.

The basic premise for my original query was that I have an array of words and phrases such as the following.

After processing a string of text such as the following.

"Salt, Water, Wheat Flour, Palm Oil, Nuts, Tree Nuts"

My goal is to have a string that looks as follows (i.e. the words and phrases from the dictionary are surrounded with asterisk characters, with the longest phrase given priority).

"Salt, Water, *Wheat Flour*, Palm Oil, *Nuts*, Tree *Nuts*"

The above is achievable by using the following Regex pattern kindly provided by Dmitry Bychenko.

  string pattern = @"\b(?<!\*)(?:" + string.Join("|", words
    .Distinct()
    .OrderByDescending(chunk => chunk.Length)
    .Select(chunk => Regex.Escape(chunk))) + @")(?!\*)\b";

I have a specific question in regards to when the string I am dealing with has already been processed.

Imagine I have a string that has already been processed, such as the following.

"Salt, Water, *Wheat Flour*, Palm Oil, *Nuts*, Tree *Nuts*"

If the array of words I want to replace within the above string now contains a more specific phrase such as "Tree Nuts" is there a Regex expression that can detect that the following phrase should be replaced?

"Tree *Nuts*"

i.e. this section of the string should be updated to the following.

"*Tree Nuts*"

Upvotes: 0

Views: 773

Answers (1)

Dmitrii Bychenko
Dmitrii Bychenko

Reputation: 186748

As a quick solution, I suggest implementing two stage replacement.

First, let's remove "erroneous" *, i.e. let turn any *word* into word:

  string[] words = new string[] {
    "Flour",
    "Wheat Flour",
    "Nut",
    "Nuts",
    "Tree Nuts"
  };

  string removePattern = @"(?:" + string.Join("|", words
    .Distinct()
    .OrderByDescending(chunk => chunk.Length)
    .Select(chunk => $@"\*{Regex.Escape(chunk)}\*")) + @")";

So given text with * we can clear it:

  string text = "Salt, Water, *Wheat Flour*, Palm Oil, *Nuts*, Tree *Nuts*";

  // unwanted * removed: 
  // "Salt, Water, Wheat Flour, Palm Oil, Nuts, Tree Nuts" 
  string cleared = Regex.Replace(text, removePattern, m => m.Value.Trim('*'));

Then (second stage) business as usual:

  string pattern = @"\b(?<!\*)(?:" + string.Join("|", words
    .Distinct()
    .OrderByDescending(chunk => chunk.Length)
    .Select(chunk => Regex.Escape(chunk))) + @")(?!\*)\b";

  string result = Regex.Replace(cleared, pattern, m => "*" + m.Value + "*");

Upvotes: 1

Related Questions