MiP
MiP

Reputation: 6432

Count how many occurrences of substrings within a string without counting duplicates

For example, I have a list of terms and a string:

var terms = { "programming language", "programming", "language" };

var content = "A programming language is a formal language that "
    + "specifies a set of instructions that can be used to "
    + "produce various kinds of output.";

I can use Regex.Matches(content, term).Count to count that there are 4 times the list appear in the string:

But there are duplicates, there should be only 2 occurrences.

My current solution is to save the begin index and end index of each occurrence, then compare to the saved occurences wherever it is in range and has already been count. Is there a better way without using start and end indexes?

Upvotes: 0

Views: 259

Answers (1)

MiP
MiP

Reputation: 6432

After suggestions from comments, I have a simple solution using regex, it should work with exact whole word, i.e. programming language can be counted but programming languages cannot:

var pattern = @"(?<!\S)programming language(?![^\s])|(?<!\S)programming(?![^\s])|(?<!\S)language(?![^\s])";
var count = Regex.Matches(content, pattern).Count;

Note: this pattern can only be used when programming language is placed before programming and language terms. If anyone can contribute a better solution, please do so.

Upvotes: 1

Related Questions