Reputation: 33
I received a very helpful response to a previous question raised here.
Use C# to surround phrases in a string with asterisk characters from a dictionary of phrases
I am now posting a follow-up question for a specific issue.
The basic premise for my original query was that I have an array of words and phrases such as the following.
After processing a string of text such as the following.
"Salt, Water, Wheat Flour, Palm Oil, Nuts, Tree Nuts"
My goal is to have a string that looks as follows (i.e. the words and phrases from the dictionary are surrounded with asterisk characters, with the longest phrase given priority).
"Salt, Water, *Wheat Flour*, Palm Oil, *Nuts*, Tree *Nuts*"
The above is achievable by using the following Regex pattern kindly provided by Dmitry Bychenko.
string pattern = @"\b(?<!\*)(?:" + string.Join("|", words
.Distinct()
.OrderByDescending(chunk => chunk.Length)
.Select(chunk => Regex.Escape(chunk))) + @")(?!\*)\b";
I have a specific question in regards to when the string I am dealing with has already been processed.
Imagine I have a string that has already been processed, such as the following.
"Salt, Water, *Wheat Flour*, Palm Oil, *Nuts*, Tree *Nuts*"
If the array of words I want to replace within the above string now contains a more specific phrase such as "Tree Nuts" is there a Regex expression that can detect that the following phrase should be replaced?
"Tree *Nuts*"
i.e. this section of the string should be updated to the following.
"*Tree Nuts*"
Upvotes: 0
Views: 773
Reputation: 186748
As a quick solution, I suggest implementing two stage replacement.
First, let's remove "erroneous" *
, i.e. let turn any *word*
into word
:
string[] words = new string[] {
"Flour",
"Wheat Flour",
"Nut",
"Nuts",
"Tree Nuts"
};
string removePattern = @"(?:" + string.Join("|", words
.Distinct()
.OrderByDescending(chunk => chunk.Length)
.Select(chunk => $@"\*{Regex.Escape(chunk)}\*")) + @")";
So given text
with *
we can clear it:
string text = "Salt, Water, *Wheat Flour*, Palm Oil, *Nuts*, Tree *Nuts*";
// unwanted * removed:
// "Salt, Water, Wheat Flour, Palm Oil, Nuts, Tree Nuts"
string cleared = Regex.Replace(text, removePattern, m => m.Value.Trim('*'));
Then (second stage) business as usual:
string pattern = @"\b(?<!\*)(?:" + string.Join("|", words
.Distinct()
.OrderByDescending(chunk => chunk.Length)
.Select(chunk => Regex.Escape(chunk))) + @")(?!\*)\b";
string result = Regex.Replace(cleared, pattern, m => "*" + m.Value + "*");
Upvotes: 1