AdityaReddy
AdityaReddy

Reputation: 3645

Algorithm for sub-string search in a large String

I am just checking for an efficient algorithm with best computational complexity to check if a child string - tobeVerified exists in a huge parent string

I was going through different algorithms but I am yet to find something which offers O(n)

I came up with the below implementation using HashSet which is gives me O(n+m) ~ O(n)

I wanted to check if this is the right way of doing it or if any other optimization is possible. But in this approach there is problem of consuming more space

String parent = "the value is very high";
    String tobeVerified = "is";
    Set wordSet = new HashSet<String>();    
    String[] words = parent.trim().toUpperCase().split("\\s+");
    //This is O(n) n - Parent Size  m - substring size
    for(String word: words){
        wordSet.add(word);      
    }
    //This is O(1)
    System.out.println(wordSet.contains(tobeVerified.toUpperCase()));
    }

Upvotes: 1

Views: 3200

Answers (2)

selbie
selbie

Reputation: 104474

One of the classic O(n+m) substring search algorithms is Boyer-Moore. It should have better performance than String.contains or String.indexOf for sufficiently large strings.

There's a Java implementation of the algorithm on that wikipedia page link above, but it's written to use a char[] array as input instead of on an instance of the String class. So either modify the code to work with a String parameter , or take into account the additional cost, O(n), of cloning a String into a char[].

One little issue I spotted on the wikipedia code. It assumes character values are only in the 8-bit range. You might need to modify this line:

final int ALPHABET_SIZE = 256;

To be this:

final int ALPHABET_SIZE = 65536;

Update: I updated the wikipedia page code appropriately to have the correct value for ALPHABET_SIZE. Confirmed the original bug existed and wrote a unit test to validate the fix.

Upvotes: 3

Andreas
Andreas

Reputation: 159086

You could go for a Boyer-Moore implementation, as suggested in answer by selbie, if profiling shows that you truly have a performance issue.

Until then, just do a simple regex search:

String textToSearch = "the value is very high";
String wordToFind = "is";
String regex = "(?i)\\b" + Pattern.quote(wordToFind) + "\\b";
boolean found = Pattern.compile(regex).matcher(textToSearch).find();

(?i) makes the search case-insensitive, and \\b matches word-boundary, e.g. ensuring that is will not match this. Since you're doing word-search, Pattern.quote() is likely unnecessary, but better be safe than sorry.

Upvotes: 1

Related Questions