Daniel
Daniel

Reputation: 6039

fuzzy .substring text-matching function

I am looking for a way to fuzzy substring function. What do I mean by that:

Example 1:

This is an exact match and should have score 1.0.

Example 2:

This is a fuzzy match, as "weed" and "destroyed" appear in the text, but without "will be". Still it should get a high-score (say 0.8).

Example 3:

If we set the "Short" to "destroyed will be weeds", although "destroyed" and "weeds" both appear in the original text, the score should be very low, since their order has changed.

Any suggested implementation on this?

A final point is that, there is no unique way of doing this scoring. But I am looking for AN algorithm. The parameters of this algorithm can be tuned based on the needs and requirements.

Upvotes: 5

Views: 362

Answers (2)

dveim
dveim

Reputation: 3482

I'd split both strings in dependency tree (something like this). Then, recursively traversed smaller tree from root and checked whether token is present in bigger tree. If yes, then add score similarity_of_dependency_kind. Optionally, that can be multiplied by similarity_of_destination_words (in terms of synonymicity, something like wordnet).

This approach is less efficient, but more accurate.

Also, don't forget preliminary data cleaning, like typos correction.

Upvotes: 2

Patrick Parker
Patrick Parker

Reputation: 4959

Here's one possible approach:

  1. for the first word short(0), store the first indexOf in long
  2. for each subsequent word short(n), store both of: a) the first indexOf in long, and b) (preferred) the first indexOf short(n) which occurs no later than the preferred indexOf short(n-1).
  3. score accordingly

Upvotes: 2

Related Questions