Reputation: 15488
Let's say I have the following text:
const input = `Hello world!
This is a random text.
I don't know...
nd ths lines got som spelling mitsakes`
And I want to find the position of I don't know...
. In that case, I could just just do this:
const position = input.indexOf(`I don't know...`)
What I'm having issues with is finding a substring that doesn't perfectly match. This for example wouldn't yield a match:
const position = input.indexOf(`And this line's got some spelling mistakes.`)
Now what I'm looking for is a way to find a given substring and allow it to differ by a given percentage.
For example:
// Find position that matches by at least 50%:
const position = indexOfSimilar(input, `And this line's got some spelling mistakes.`, 0.5)
// Find position that matches by at least 99%:
const position = indexOfSimilar(input, `And this line's got some spelling mistakes.`, 0.99)
Any ideas how this could be implemented?
Upvotes: 0
Views: 323
Reputation: 3856
As some commenters mentioned about the Levenshtein distance, it could probably be a starting point, by implementing an algorithm based on that. A simpler solution would be to use a sliding window and find the matching substring and its index by the weight of matching pattern in the input string, for example:
const input = `Hello world!
This is a random text.
I don't know...
nd ths lines got som spelling mitsakes`
const match = (input, pattern) => {
const ids = new Map;
for (let i = 0; i < input.length; i++) {
const s = input.slice(i, i + pattern.length);
const n = [...s].reduce((a, c, i) => (a += (c === pattern[i] ? c.charCodeAt(0) * (i + 1) : 0)), 0);
ids.set(i, n);
}
const matches = [...ids].sort(([, n1], [, n2]) => n2 - n1);
const [id, n] = matches[0];
const max = [...pattern].reduce((a, c, i) => a += c.charCodeAt(0) * (i + 1), 0);
const weight = Math.floor(n / max * 100);
return {weight, id, slice: input.slice(id, id + pattern.length)};
};
console.log(match(input, `qwerty uiop`));
console.log(match(input, `This is a random text.`));
console.log(match(input, `And this line's got some spelling mistakes.`));
Sure, the results of this simple solution heavily depends on the length of pattern and matched substring. The Levenshtein distance solution returns more accurate results:
const input = `Hello world!
This is a random text.
I don't know...
nd ths lines got som spelling mitsakes`
const levenshtein = (a, b) => {
const la = a.length, lb = b.length,
aa = [...a, 0], ab = [...b, 0],
map = ab.map(() => [...aa]);
aa.forEach((v, i) => map[0][i] = i);
ab.forEach((v, i) => map[i][0] = i);
for (let j = 1; j <= lb; j++) {
for (let i = 1; i <= la; i++) {
const n = a[i - 1] === b[j - 1] ? 0 : 1;
map[j][i] = Math.min(
map[j][i - 1] + 1, map[j - 1][i] + 1, map[j - 1][i - 1] + n
);
}
}
return map[lb][la];
};
const match = (input, pattern) => {
const ids = new Map, lp = pattern.length;
for (let i = 0; i < input.length; i++) {
const s = input.slice(i, i + lp);
const n = levenshtein(s, pattern);
ids.set(i, n);
}
const matches = [...ids].sort(([, n1], [, n2]) => n1 - n2);
const [id, n] = matches[0];
const weight = Math.floor((lp - n) / lp * 100);
return {weight, id, slice: input.slice(id, id + lp)};
};
console.log(match(input, `qwerty uiop`));
console.log(match(input, `This is a random text.`));
console.log(match(input, `And this line's got some spelling mistakes.`));
Upvotes: 1