Reputation: 1251

How can I check for a reoccurring pattern in a string without knowing the actual pattern?

For example, I have a string, "fbrtfuifigfbrt". I want to find whether a sequence of characters reoccurs in a string, but I don't know what that sequence of characters is. In this case , it is fbrt.

I thought about breaking the string into a bunch of individual words and then checking if the words are the same, but that quickly becomes inefficient when parsing a longer string.

For now, I implemented the above idea, but surely there's a better idea.

String s = "fbrtfuifigfbrt";
ArrayList<String> words = new ArrayList<String>(s.length() * s.length());

for(int outerLoop = 0; outerLoop <= s.length(); outerLoop++){
    for(int nestedLoop = 0; nestedLoop <= s.length(); nestedLoop++){
        words.add(fileContents.substring(outerLoop, nestedLoop));
    }
}
//I could dump the ArrayList in a HashSet and check if they are the same size, 
//then find those elements, etc. 
//but that goes along with the above code, and I would prefer to use a more efficient method

Upvotes: 2

Answers (3)

walen

Reputation: 7273

Working solution in Java:

import java.util.ArrayList;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        String test1 = "fbrtfuifigfbrt";
        String test2 = "abcdabcd";
        String test3 = "fbrtxibrjkfbrt";
        System.out.println(findRepetitions(test1));
        System.out.println(findRepetitions(test2));
        System.out.println(findRepetitions(test3));
    }

    private static List<String> findRepetitions(String string) {
        List<String> patternsList = new ArrayList<>();
        int length = string.length();
        for (int i = 0; i < length; i++) { // search the first half
            int limit = (length - i) / 2; // candidates can't be longer than half the remaining length
            for (int j = 1; j <= limit; j++) {
                int candidateEndIndex = i + j;
                String candidate = string.substring(i, candidateEndIndex);
                if (string.substring(candidateEndIndex).contains(candidate)) {
                    patternsList.add(candidate);
                }
            }
        }
        return patternsList;
    }
}

Output:

[f, fb, fbr, fbrt, b, br, brt, r, rt, t, f, i, f]
[a, ab, abc, abcd, b, bc, bcd, c, cd, d]
[f, fb, fbr, fbrt, b, br, brt, r, rt, t, b, br, r]

As others already said, there's no easy optimization for this if you don't know the length of the pattern or any other applicable restriction.

If you wanted to naively discard subpatterns like f, fb, fbr which are being counted just because they are substrings of the longest fbrt pattern, you could make the inner for count downwards, from limit down to 1, so you would find longer patterns first, and then check if the next patterns are a substring of already found ones before adding them to the list. Like this:

import java.util.ArrayList;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        String test1 = "fbrtfuifigfbrt";
        String test2 = "abcdabcd";
        String test3 = "fbrtxibrjkfbrt"; // "br" is a pattern but this version won't find it
        System.out.println(findRepetitions(test1));
        System.out.println(findRepetitions(test2));
        System.out.println(findRepetitions(test3));
    }

    private static List<String> findRepetitions(String string) {
        List<String> patternsList = new ArrayList<>();
        int length = string.length();
        for (int i = 0; i < length; i++) { // search the first half
            int limit = (length - i) / 2; // candidates can't be longer than half the remaining length
            for (int j = limit; j >= 1; j--) {
                int candidateEndIndex = i + j;
                String candidate = string.substring(i, candidateEndIndex);
                if (string.substring(candidateEndIndex).contains(candidate)) {
                    boolean notASubpattern = true;
                    for (String pattern : patternsList) {
                        if (pattern.contains(candidate)) {
                            notASubpattern = false;
                            break;
                        }
                    }
                    if (notASubpattern) {
                        patternsList.add(candidate);
                    }
                }
            }
        }
        return patternsList;
    }
}

This, however, would prevent you from finding br in fbrtxzbrjkfbrt, as shown by the output (and it'd make the algorithm slower for strings with a lot of different patterns, too):

[fbrt, i]
[abcd]
[fbrt]

Hence the naively part. Of course, you could include more inner loops to make sure to-be-discarded candidates aren't found "on their own" in the original string, before actually discarding them... etc. It depends on how exahustive you want your search to be.

Upvotes: 1

Xiangyu Li

Reputation: 71

You need to have two iterators, the first pointer is the global iterator over the entire string and the second iterator serves as the search pointer. Let's suppose the first iterator points to the char "f" in your example. We need to find all the positions of "f" after the global iterator. For each "f" found after the global iterator, we need to compare characters one by one after both global iterator and local iterator (Think of this as two pointers move at the same speed until they point to different chars). Once local iterator reaches the end of the string, you can move the global iterator forward by one character (yes you need to do this n times provided you have n characters in your string).

I'm sorry that the code is in C++ but the logic is the same in Java.

Update: There is another way to perform the task. One popular solution is to use a suffix tree to store your text. You can then search the suffix tree with any given substring to find occurrences of the given substring in the whole text. Building of the tree is O(n) and search for a substring depends on the size of your alphabet which is 26 if you are using only english letters. So if you want to find all reoccurring patterns, you only need to perform the search for each substrings of the given text. Which will be only O(n^2). So this algorithm has the overall advantage over the algorithm I propose. But if you don't need performance, my algorithm will definitely suit your need, since it is simple and easy implementable.

#include <iostream>
#include <string>
#include <vector>
using namespace std;
int main(int argc, const char * argv[]) {
    string s = "sdfssdddfssss";
    int pairCount = 0;
    vector<string> rep;
    for (int i = 0; i < s.length(); i++)
    {
        vector<int> idx;
        //find all index of all same char as s[i] after i
        //Note: You can optimize this by creating a map of index of 26 letters.
        for (int j = i+1; j < s.length(); j++)
            if (s[i] == s[j]) idx.push_back(j);
        int offset = 0;
        for (int j = 0; j < idx.size(); j++)
        {
            while (s[i+offset] == s[idx[j]+offset])
            {
                cout << "Pair found! " << s.substr(i, offset+1) << " " << i << " " << idx[j] << " " << offset + 1 << endl;
                pairCount++;
                offset++;
            }
            offset = 0;
        }
    }
    cout << "Pair count: " << pairCount;
    return 0;
}

Upvotes: 0

nhouser9

Reputation: 6780

There isn't a good optimization for this. You are going to end up with some kind of a brute force solution.

Something like:

String myString = "abcabcbbb";
//for each char
for (int i = 0; i < myString.length(); i++) {
    //for each substring starting with that char
    int maxSubStringLen = Math.floorDiv(myString.length() - i, 2);
    for (int j = 1; j <= maxSubStringLen; j++) {
        //get the substring
        String subString = myString.substring(i, i + j);
        int repetitionIndex = i + j;
        String repetition = myString.substring(repetitionIndex, repetitionIndex + subString.length());

        //does the substring repeat?
        if (subString.equals(repetition)) {
            System.out.println(subString);
        }
    }
}

This simply prints all substrings that mach. You can replace the print statement with whatever you actualyl want to do with them.

Upvotes: 1

How can I check for a reoccurring pattern in a string without knowing the actual pattern?

Answers (3)

Related Questions