Eurig Jones
Eurig Jones

Reputation: 8543

Efficient String text search

I'd like to create a method which searches a small String of text (usually no more than 256 characters) for the existence of any of about 20 different words. If it finds one in the text regardless of case it returns a true.

The method will be executed a quite a bit (not a crazy amount) so it has to be as efficient as possible. What do you suggest would be best here?

The 20 words do not change. They are static. But the text to scan does.

Upvotes: 2

Views: 202

Answers (8)

Eurig Jones
Eurig Jones

Reputation: 8543

Ok. Thanks for answering and commenting everybody. I realise that the question I asked can have broad and varied answers. But this is what I ended up using because the performance was very important so using standard Collections just won't cut the mustard.

I used a "Patricia Trie" structure which is a very powerful and elegant datastructure capable of offering low memory overheads and extremely fast search speeds.

If anyone is interested, there is a video here briefly explaining how a Patricia Trie works. You will realise why it's so performant after watching. Also there is a Java implementation of the data structure on github here.

Upvotes: 0

rossum
rossum

Reputation: 15685

If you want to search for a number of different targets simultaneously, then the Rabin-Karp algorithm is a possibility. If is especially efficient if there are only a few different word lengths in your list of 20 targets. One single pass through the string will find all the matches of a given length.

Upvotes: 1

Michael Cheremuhin
Michael Cheremuhin

Reputation: 1383

You can get all the words to a List, sort it and use Collections.binarySearch(...). You will loose on sorting, but the binarySearch is log(n).

Upvotes: 0

MaVVamaldo
MaVVamaldo

Reputation: 2535

I'd do the following:

String longStr //the string to search into
ArrayList<String> words; //the words to check

Iterator<String> iter = words.iterator();
while(iter.hasNext())
{
    if(longStr.contains(iter.next()))
        return true;    
}
return false;

Upvotes: 0

Joni
Joni

Reputation: 111249

Since the 20 words to search don't change, one of the fastest ways to look for them is compiling a regular expression that matches them and reuse it on different inputs. The complexity of matching a regular expression to a given string is linear to the string length for simple regular expressions that don't require backtracking. In your case the length is bounded, so it's O(1).

Upvotes: 3

fge
fge

Reputation: 121710

Assuming these 20 words are in a Set<String> and all are lowercase, then it is as easy as:

public final boolean containsWord(final String input)
{
    final String s = input.toLowerCase();
    for (final String word: wordSet)
        if (s.indexOf(word) != -1)
            return true;
    return false;
}

Upvotes: 2

devrobf
devrobf

Reputation: 7213

The String class already has lots of methods to do these sorts of things. For example, the indexOf method will solve your problem:

String str = "blahblahtestblah";
int result = str.indexOf("test");

result will contain -1 if the string does not contain the word "test". I'm not sure if this is efficient enough for you but I would start here as it's been implemented already!

Upvotes: 2

&#211;scar L&#243;pez
&#211;scar L&#243;pez

Reputation: 236004

I'd suggest: add all the words in the input text to a Set - it's only 256 characters after all, and adding them is an O(n) operation.

After that you can test each of the 20 or so words for membership using the contains() operation of the Set, which is O(1).

Upvotes: 5

Related Questions