Forivin
Forivin

Reputation: 15488

Find position of substring that almost matches

Let's say I have the following text:

const input = `Hello world!

This is a random text.
I don't know...
nd ths lines got som spelling mitsakes`

And I want to find the position of I don't know.... In that case, I could just just do this:

const position = input.indexOf(`I don't know...`)

What I'm having issues with is finding a substring that doesn't perfectly match. This for example wouldn't yield a match:

const position = input.indexOf(`And this line's got some spelling mistakes.`)

Now what I'm looking for is a way to find a given substring and allow it to differ by a given percentage.

For example:

// Find position that matches by at least 50%:
const position = indexOfSimilar(input, `And this line's got some spelling mistakes.`, 0.5)

// Find position that matches by at least 99%:
const position = indexOfSimilar(input, `And this line's got some spelling mistakes.`, 0.99)

Any ideas how this could be implemented?

Upvotes: 0

Views: 323

Answers (1)

n--
n--

Reputation: 3856

As some commenters mentioned about the Levenshtein distance, it could probably be a starting point, by implementing an algorithm based on that. A simpler solution would be to use a sliding window and find the matching substring and its index by the weight of matching pattern in the input string, for example:

const input = `Hello world!

This is a random text.
I don't know...
nd ths lines got som spelling mitsakes`

const match = (input, pattern) => {
  const ids = new Map;
  for (let i = 0; i < input.length; i++) {
    const s = input.slice(i, i + pattern.length);
    const n = [...s].reduce((a, c, i) => (a += (c === pattern[i] ? c.charCodeAt(0) * (i + 1) : 0)), 0);
    ids.set(i, n);
  }
  const matches = [...ids].sort(([, n1], [, n2]) => n2 - n1);
  const [id, n] = matches[0];
  const max = [...pattern].reduce((a, c, i) => a += c.charCodeAt(0) * (i + 1), 0);
  const weight = Math.floor(n / max * 100);
  return {weight, id, slice: input.slice(id, id + pattern.length)};
};

console.log(match(input, `qwerty uiop`));
console.log(match(input, `This is a random text.`));
console.log(match(input, `And this line's got some spelling mistakes.`));

Sure, the results of this simple solution heavily depends on the length of pattern and matched substring. The Levenshtein distance solution returns more accurate results:

const input = `Hello world!

This is a random text.
I don't know...
nd ths lines got som spelling mitsakes`

const levenshtein = (a, b) => {
   const la = a.length, lb = b.length,
   aa = [...a, 0], ab = [...b, 0],
   map = ab.map(() => [...aa]);
   aa.forEach((v, i) => map[0][i] = i);
   ab.forEach((v, i) => map[i][0] = i);
   for (let j = 1; j <= lb; j++) {
      for (let i = 1; i <= la; i++) {
         const n = a[i - 1] === b[j - 1] ? 0 : 1;
         map[j][i] = Math.min(
            map[j][i - 1] + 1, map[j - 1][i] + 1, map[j - 1][i - 1] + n
         );
      }
   }
   return map[lb][la];
};

const match = (input, pattern) => {
  const ids = new Map, lp = pattern.length;
  for (let i = 0; i < input.length; i++) {
    const s = input.slice(i, i + lp);
    const n = levenshtein(s, pattern);
    ids.set(i, n);
  }
  const matches = [...ids].sort(([, n1], [, n2]) => n1 - n2);
  const [id, n] = matches[0];
  const weight = Math.floor((lp - n) / lp * 100);
  return {weight, id, slice: input.slice(id, id + lp)};
};

console.log(match(input, `qwerty uiop`));
console.log(match(input, `This is a random text.`));
console.log(match(input, `And this line's got some spelling mistakes.`));

Upvotes: 1

Related Questions