Thomas
Thomas

Reputation: 53

The most frequent substring of length X

We have a string of length N and the number X.

How to find the most frequent substring of length X in the string of length N in average O(N) time?

I think, here is a similar question: https://stackoverflow.com/questions/1597025?tab=votes#tab-top

I would like to ask you how to prove that the number of used hashing functions is only a constant.

Upvotes: 5

Views: 7799

Answers (2)

Aryabhatta
Aryabhatta

Reputation:

A suffix tree should give this in O(n) time worst case, with O(n) space usage.

In particular check the Functionality section of the above wiki page under the Properties of Strings sub section, which mentions

Find the most frequently occurring substrings of a minimum length in Θ(n) time.

Upvotes: 3

Mihran Hovsepyan
Mihran Hovsepyan

Reputation: 11088

I suggest this kind of hash function. Lets condider that each string is number in 256 base notation (instead of our 10 based). So for each X length substring we can get its value in 10 base notation this way:

#include <iostream>
#include <string>
#include <map>
#include <algorithm>


int main()
{
    std::string s;
    int x;
    std::cin >> s >> x;

    unsigned const int base = 256;
    unsigned long long xPowOfBase = 1;
    int i = 0;
    for(i = 1; i <= x; ++i)
        xPowOfBase *= base;

    unsigned long long firstXLengthSubString = 0;
    for(i = 0; i < x; ++i)
    {
        firstXLengthSubString *= base;
        firstXLengthSubString += s[i];
    }

    unsigned long long nextXLengthSubstring = firstXLengthSubString;

    std::map<unsigned long long, std::pair<int, int> > hashTable;
    for(;i <= s.size(); ++i)
    {
        if(hashTable.find(nextXLengthSubstring) != hashTable.end())
            ++hashTable[nextXLengthSubstring].first;
        else
            hashTable.insert(std::make_pair(nextXLengthSubstring, std::make_pair(1, i - x)));

        if(i != s.size())
        {
            nextXLengthSubstring *= base;
            nextXLengthSubstring += s[i];
            nextXLengthSubstring -= s[i - x] * xPowOfBase;
        }
    }

    std::map<unsigned long long, std::pair<int, int> >::iterator it = hashTable.begin();
    std::map<unsigned long long, std::pair<int, int> >::iterator end_it = hashTable.end();
    std::pair<int, int> maxCountAndFirstPosition = std::make_pair(0, -1);

    for(;it != end_it; ++it)
    {
        if(maxCountAndFirstPosition.first < it->second.first)
            maxCountAndFirstPosition = it->second;
    }

    std::cout << maxCountAndFirstPosition.first << std::endl;
    std::cout << s.substr(maxCountAndFirstPosition.second, x) << std::endl;
    return 0;
}

This will work on O(n * log(n)) , to make it O(n) just change std::map wiht any hash table.

Upvotes: 0

Related Questions