Find number of occurrences of substring

I have a small problem. I'm solving one programming task, but have a problem with it. It is simple one, but time limit make it a bit harder.

Find number of occurrences of substring. You will be given M - length of substring; substring to find, N - length of base string; base string.
M <= 100 000
N<= 200 000

Input

10
budsvabbud
79
uaahskuskamikrofonubudsvabbudnebudlabutkspkspkspmusimriesitbudsvabbudsvabbudnel

Output
3

I tried to use using build-in function find,but it wasn't fast enough:

#include<iostream>
#include<string>

using namespace std;

int main()
{
    int n;
    int occurrences = 0;
    string::size_type start = 0;
    string base_string, to_find;
    cin >> n >> to_find >> n >> base_string;
    while ((start = base_string.find(to_find, start)) != string::npos) {
        ++occurrences;
        start++;; // see the note
    }
    cout << occurrences << endl;
}

So I tried to write my own function, but it was even slower:

#include<iostream>
#include<cstdio>
#include<string>
#include<queue>

using namespace std;

int main()
{
    int n, m;
    string to_find;
    queue<int> rada;  
    int occurrences = 0;
    cin >> m >> to_find >> n;
    for (int i = 0; i < n; i++)
    {
        char c;
        scanf(" %c", &c);
        int max = rada.size();
        for (int j = 0; j < max; j++)
        {
            int index = rada.front();
            rada.pop();
            if (c == to_find[index])  
            {
                if (++index == m) {
                    occurrences++;
                }
                else
                    rada.push(index);
            }
        }
        if (c == to_find[0])
        {
            if (1 == m)
                n++;
            else
                rada.push(1);
        }
    }
    cout << occurrences << endl;

}

I know some people did this in 0 ms, but my first code needs more than 2000 ms and the second one a lot more than that. Have you any ideas how to solve this? Thanks.

EDIT: Limits of length:

M <= 100 000 - length of substring

N<= 200 000 - lenght of base string

Upvotes: 0

Answers (3)

Ari Hietanen

Reputation: 1769

The algorithm you presented is an O(M*N), where N is the length of the text and M is the length of the searched world. Usually, also the libraries implement the naive algorithm. However, there is an algorithm by Knuth, Morrison and Pratt, which does it in a O(M+N) time. See, e.g., Wikipedia Knuth-Morrison-Pratt Algorithm. It has some variations which might be easier to implement like Boyer-Moore-Horsepool.

Upvotes: 2

Neijwiert

Reputation: 1035

Safe version

static size_t findOccurences(const char * const aInput, const char * const aDelim)
{
    if (aInput == 0x0 || aDelim == 0x0)
    {
        throw std::runtime_error("Argument(s) null");
    }

    const size_t inputLength = strlen(aInput);
    const size_t delimLength = strlen(aDelim);

    size_t result = 0;

    if (delimLength <= inputLength && delimLength > 0)
    {
        size_t delimIndex = 0;

        for (size_t inputIndex = 0; inputIndex < inputLength; inputIndex++)
        {
            if (aInput[inputIndex] != aDelim[delimIndex])
            {
                delimIndex = 0;
            }
            else
            {
                delimIndex++;

                if (delimIndex == delimLength)
                {
                    delimIndex = 0;
                    result++;
                }
            }
        }
    }

    return result;
}

Unsafe version

static size_t unsafeFindOccurences(const char * const aInput, const char * const aDelim)
{
    const size_t inputLength = strlen(aInput);
    const size_t delimLength = strlen(aDelim);

    size_t result = 0;
    size_t delimIndex = 0;

    for (size_t inputIndex = 0; inputIndex < inputLength; inputIndex++)
    {
        if (aInput[inputIndex] != aDelim[delimIndex])
        {
            delimIndex = 0;
        }
        else
        {
            delimIndex++;

            if (delimIndex == delimLength)
            {
                delimIndex = 0;
                result++;
            }
        }
    }

    return result;
}

Results safe

          x86        x64
Debug     5501ms     5813ms
Release   3889ms     3998ms

Results unsafe

          x86        x64
Debug     5442ms     5564ms
Release   3074ms     3139ms

Compiled with Visual Studio 2015, Visual Studio 2015 (v140) toolset under Windows 10 x64 Pro.

Using this input. Searching for 'ad' and 1.000.000 iterations.

Upvotes: 1

Humam Helfawi

Reputation: 20311

I try this code in debug mode without any optimization and it took 11 mSec. VS.NET 2013 , Intel Core i7:

int main()
{
    int n;
    int occurrences = 0;
    string::size_type start = 0;
    string base_string, to_find;
    base_string.reserve(200000);
    to_find.reserve(100000);
    for (size_t i = 0; i < 100000; i++){
        base_string.push_back('a');
    }
    for (size_t i = 0; i < 100000; i++){
        base_string.push_back('b');
    }
    for (size_t i = 0; i < 100000; i++){
        to_find.push_back('b');
    }
    auto start_s = clock();
    while ((start = base_string.find(to_find, start)) != string::npos) {
        ++occurrences;
        start++;; // see the note
    }
    auto stop_s = clock();
    std::cout << (stop_s - start_s) / double(CLOCKS_PER_SEC) * 1000;
    cout << occurrences << endl;
    std::getchar();
}

There is a problem in compiler, configuration, your machine, but in your code.