Reputation: 598
Im looking for a way to find all repeating sequences consisting of at least 3 characters on an input file and then print out the most frequent ones! it seems to require a lot of string processing and intense searching through the input file specially cause there is no upper bound on the max size of the patterns to look for!
Is there any efficient algorithm to do it with the least possible processing and messiness? Should I use string.h or I'd be better off with char arrays? Any tips/helpful snippets etc. on how to start?
tnx
Upvotes: 1
Views: 2188
Reputation: 258618
Finding the most frequent one is quite easy, if you realize that the most frequent sequence is 4 characters long. It can be done in O(n)
time, where n
is the size of the input file.
You can build a std::map<string,int>
, iterate character by character taking sequences of 4 characters at a time, and increment the value with the respective key in the map.
Upvotes: 1
Reputation: 70939
I would suggest that you do create a suffix tree from the file. This will have linear complexity with respect to the size of your file and will solve the problem. You can modify the algorithm just a little bit to store how many times is a string met apart from the string itself. Here is a great post explaining how to create a suffix tree.
Upvotes: 5