danikitty
danikitty

Reputation: 31

How can I efficiently search a string for occurrences of words?

Essentially, I have a Set of words, about 250,000 of them and want to be able to return a list of which ones are found in a given string.

eg. input string is 'APPLEASEDITION', I want to return

[APP,APPLE,PLEA, PLEAS,PLEASE,PLEASED,lEA,LEAS,LEASE,LEASED,EA,EAS,EASE,EASED,AS,SEDITION,EDITION,IT,TI,ON]

I came up with this code, which works faster than the method mentioned above for shorter input strings (up to 15 characters), but doubles in execution time with each added letter:

const findWords = (instring, solutions = null) => {
  if (!solutions) solutions = new Set();
  if (!instring) {
    return new Set();
  }
  if (words[instring]) {
    solutions.add(instring);
  }
  const suffix = instring.slice(1);
  const prefix = instring.slice(0, instring.length - 1);

  if (!solutions.has(prefix))
    solutions = new Set([...solutions, ...findWords(prefix, solutions)]);
  if (!solutions.has(suffix))
    solutions = new Set([...solutions, ...findWords(suffix, solutions)]);
  return solutions;
};

Wondering if anyone can help me out optimizing the code?

Edit:

I made a different solution, it works much better

const getAllSubstrings = (str) => {
  let result = [];

  for (let i = 0; i < str.length; i++) {
      for (let j = i + 1; j < str.length + 1; j++) {
          result.push(str.slice(i, j));
      }
  }
  return result;
}

const findWords = (instring) => {
  const solutions = []
  let subs = getAllSubstrings(instring)
  for (let sub of subs) {
    if (words[sub])
      solutions.push(sub)
  }
  return solutions
}

Still open to possible improvements, but this works well enough for my use case

Upvotes: 2

Views: 135

Answers (1)

Max Hudson
Max Hudson

Reputation: 10226

  1. As it stands your logic assumes your input starts or ends with the phrase, but doesn't consider words in the middle - you'll need to generate permutations

  2. Convert your dictionary to a hash where the words are keys - O(n) => O(1) - you can check if possible words are in the dictionary by checking dictionary[possibleWord]

  3. You could convert your array of dictionary words into a binary search tree or a trie - there might be a performance benefit to converting your source text to a collection of BSTs/Tries, where each one represents a possible word/permutation, and then comparing BSTs/Tries rather than strings, but I'm not sure how that'd be faster than string comparison at the moment.

You can limit the length to the max length of a given permutation to the words in your dictionary. You'll end up with a lot of permutations, but possibly less than you have currently.

As the comments state you may want to do this server side for more power/in a language more efficient than JS, or using WASM.

Some javascript libraries that have binary search tree tools:

  1. Alternatively, you might be able to create two hashes (one of permutations, one of dictionary words), or another data structure that's made for creating a "diff" or "overlap", and extract the keys that are in both sets.

Upvotes: 2

Related Questions