Reputation: 177
I am trying to figure out an efficient way to find duplicate phrases in a large string. The string will contain hundreds or thousands of words separated by an empty space. I've included code below that I am currently using but it is very inefficient in finding duplicate phrases.
public static string FindDuplicateSubstringFast(string s, string keyword, bool allowOverlap = true)
{
int matchPos = 0, maxLength = 0;
if (s.ToLower().Contains(keyword.ToLower()))
for (int shift = 1; shift < s.Length; shift++)
{
int matchCount = 0;
for (int i = 0; i < s.Length - shift; i++)
{
if (s[i] == s[i + shift])
{
matchCount++;
if (matchCount > maxLength)
{
maxLength = matchCount;
matchPos = i - matchCount + 1;
}
if (!allowOverlap && (matchCount == shift))
{
// we have found the largest allowable match
// for this shift.
break;
}
}
else matchCount = 0;
}
}
string newbs = s.Substring(matchPos, maxLength);
if (maxLength > 3) return s.Substring(matchPos, maxLength);
else return null;
}
I found the example code above @ Find duplicate content in string?
This method is going through every char and I would like to find a way to loop through each word. I'm not sure what would be the best way to do this. I was thinking I could split the string on the empty spaces and then put the words into a list. Iterating through a list should be way more efficient than iterating over every char like I am doing now. However, I don't know how I would iterate through the list and find duplicate phrases.
If anyone could help me figure out an algorithm to iterate through a list to find duplicate phrases, I would be very grateful. I would also be open to any other ideas or methods to find duplicate phrases within a large string.
Please let me know if any more info is needed.
EDIT: Here is an example of a large string {its small for this example}
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s.
For example sake "Lorem Ipsum" would be the duplicate phrase. I need to return "Lorem Ipsum" and any other duplicate phrases that appear in the string more than once.
Upvotes: 3
Views: 3046
Reputation: 637
string[] split = BigString.Split(' ').ToLower();
var duplicates = new Dictionary<string, int>();
for (int i = 0;i<split.Length;i++)
{
int j=i;
string s = split[i] + " ";
while(i+j<split.Length)
{
j++;
s += split[j] + " ";
if (Regex.Matches(BigString.ToLower(), s).Count ==1) break;
duplicates[s] = Regex.Matches(BigString.ToLower(), s).Count;
}
}
Now, the dictionary will contain all the phrases and "subphrases" e.g. "Lorem Ipsum Dolor" will find "Lorem Ipsum" and "Lorem Ipsum Dolor". If that's not interesting to you, it's just a matter of looping through the Keys
Collection of duplicates
. If one key is a substring of another key, and their value is the same, remove said key.
Upvotes: 6