myrocks2
myrocks2

Reputation: 305

Faster String Matching/Iteration Method?

In the program I'm currently working on, there's one part that's taking a bit long. Basically, I have a list of Strings and one target phrase. As an example, let's say the target phrase is "inventory of finished goods". Now, after filtering out the stop word (of), I want to extract all Strings from the list that contains one of the three words: "inventory", "finished", and "goods". Right now, I implemented the idea as follows:

String[] targetWords; // contains "inventory", "finished", and "goods"
ArrayList<String> extractedStrings = new ArrayList<String>();

for (int i = 0; i < listOfWords.size(); i++) {
    String[] words = listOfWords.get(i).split(" ");
    outerloop:
    for (int j = 0; j < words.length; j++) {
        for (int k = 0; k < targetWords.length; k++) {
            if (words[j].equalsIgnoreCase(targetWords[k])) {
                extractedStrings.add(listOfWords.get(i));
                break outerloop;
            }
        }
    }
}

The list contains over 100k words, and with this it takes rounghly .4 to .8 seconds to complete the task for each target phrase. The things is, I have a lot of these target phrases to process, and the seconds really add up. Thus, I was wondering if anyone knew of a more efficient way to complete this task? Thanks for the help in advance!

Upvotes: 6

Views: 214

Answers (5)

nd.
nd.

Reputation: 8942

You are passing trough each of the elements from targetWords, instead of checking for all words from targetWords simultaneously. In addition, you are splitting your list of words in each iteration without really needing it, creating overhead.

I would suggest that you combine your targetWords into one (compiled) regular expression:

(?xi)  # turn on comments, use case insensitive matching
\b     # word boundary, i.e. start/end of string, whitespace
(      # begin of group containing 'inventory' or 'finished' or 'goods'
 inventory|finished|goods  # bar separates alternatives
)      # end of group
\b     # word boundary

Don't forget to double-quote the backspaces in your regular expression string.

import java.util.regex.*;
...
Pattern targetPattern = Pattern.compile("(?xi)\\b(inventory|finished|goods)\\b");
for (String singleString : listOfWords) {
  if (targetPattern.matcher(singleString).find()) {
    extractedStrings.add(singleString);
  }
}

If you are not satisfied with the speed of regular expressions - although regular expression engines are usually optimized for performance - you need to roll your own high-speed multi-string search. The Aho–Corasick string matching algorithm is optimized for searching several fixed strings in text, but of course implementing this algorithm is quite some effort compared with simply creating a Pattern.

Upvotes: 1

denov
denov

Reputation: 12726

I'm a little confused to if you want the whole phrase or just single words from listOfWords. If you are trying to get the string from listOfWords if one of your target words is in the string this should work for you.

    String[] targetWords= new String[]{"inventory", "finished", "goods"};
    List<String> listOfWords = new ArrayList<String>();

    // build lookup map
    Map<String, ArrayList<String>> lookupMap = new HashMap<String, ArrayList<String>>();
    for(String words : listOfWords) {
        for(String word : words.split(" ")) {
            if(lookupMap.get(word) == null) lookupMap.put(word, new ArrayList<String>());
            lookupMap.get(word).add(words);
        }
    }

    // find phrases
    Set<String> extractedStrings = new HashSet<String>();
    for(String target : targetWords) {
        if(lookupMap.containsKey(target)) extractedStrings.addAll(lookupMap.get(target));
    }

Upvotes: 1

MattR
MattR

Reputation: 7048

Your list of 100k words could be added (once) to a HashSet. Rather than iterating through your list, use wordSet.contains() - a HashSet gives constant-time performance for this, so not affected by the size of the list.

Upvotes: 6

Ralph Caraveo
Ralph Caraveo

Reputation: 10225

You can take your giant list of words and add them to a hash map and then when your phrase comes in, just loop over the words in your phrase and check against the hash map. Currently you are doing a linear search and what I'm proposing would cut it down to a constant time search.

The key is minimizing lookups. Using this technique you would be effectively indexing your giant list of words for fast lookups.

Upvotes: 2

Alex
Alex

Reputation: 11579

I would try to implement it with ExecutorService to parallelize search for each word. http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html

For example with fixed thread pool size:

Executors.newFixedThreadPool(20);

Upvotes: 0

Related Questions