Python fuzzy string search with `regex`

Question

Trying to understand fuzzy pattern matching with regex. What I want: I have a string, and I want to find identical or similar strings in other, perhaps larger strings. (Does one field in a database record occur, perhaps as a fuzzy substring, in any other field in that database record?)

Here's a sample. Comments indicate character positions.

import regex
to_search = "1990 /"
            #123456
            # ^^ ^
search_in = "V CAD-0000:0000[01] ISS 23/10/91"
            #12345678901234567890123456789012
            #                           ^^ ^
m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)

result:

>>> m

>>> m.fuzzy_changes
([], [], [28, 29, 31])

No substitutions, no insertions, 3 deletions at positions 28, 29 and 31. The order "substitutions insertions deletions" matters, it's taken from here.

Question: how to interpret this, in normal human language? What it says (I think):

"You have a match from substring 10/ in your search_in, if you delete positions 28, 29 and 31 in it."

I probably got that wrong. This is true tho':

"If you delete positions 5, 3 and 2, in that order, in to_search, you have an exact match at substring 10/ in search_in, yay!"

Fortunately, I found a guru! So I did

>>> import orc
>>> m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
>>> m

>>> near_match = orc.NearMatch.from_regex(m, to_search)
>>> print(near_match)
10/
 I
190/
  I
1990/
    I
1990 /

Hmm... so the order of fuzzy_counts, is in fact, something, something, insertions?

I'd appreciate if anyone could shed some light on this.

nithinks · Accepted Answer

you are close. but according to the docs you mentioned in the post, this is what is going on here.

import regex
to_search = "1990 /"
            #123456
            # ^^ ^
search_in = "V CAD-0000:0000[01] ISS 23/10/91"
            #12345678901234567890123456789012
            #                           ^^ ^
m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
m

output:

m.fuzzy_changes

output:

([], [], [28, 29, 31])

EXPLAINATION

let's break it down step by step:

The Context:

You're searching for the exact sequence "1990 /" within a longer text "V CAD-0000:0000[01] ISS 23/10/91".

The Findings:

Match Found: search discovered a similar sequence "10/" within the longer text.
Position: This "10/" sequence was found starting at positions 27 to 30 in the longer text.

The Analysis:

To get an exact match we should have had the longer string as this

V CAD-0000:0000[01] ISS 23/1990 /91

However, there were a few changes made to that string to get the actual string.

Changes:

Deletions:
- Locations: Positions 28, 29, and 31 in the presumed original sequence V CAD-0000:0000[01] ISS 23/1990 /91 were deleted.
- Resultant String: After these deletions, the presumed original sequence became the actual sequence V CAD-0000:0000[01] ISS 23/10/91.