Reputation: 3862
Trying to understand fuzzy pattern matching with regex. What I want: I have a string, and I want to find identical or similar strings in other, perhaps larger strings. (Does one field in a database record occur, perhaps as a fuzzy substring, in any other field in that database record?)
Here's a sample. Comments indicate character positions.
import regex
to_search = "1990 /"
#123456
# ^^ ^
search_in = "V CAD-0000:0000[01] ISS 23/10/91"
#12345678901234567890123456789012
# ^^ ^
m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
result:
>>> m
<regex.Match object; span=(27, 30), match='10/', fuzzy_counts=(0, 0, 3)>
>>> m.fuzzy_changes
([], [], [28, 29, 31])
No substitutions, no insertions, 3 deletions at positions 28, 29 and 31. The order "substitutions insertions deletions" matters, it's taken from here.
Question: how to interpret this, in normal human language? What it says (I think):
"You have a match from substring
10/
in yoursearch_in
, if you delete positions 28, 29 and 31 in it."
I probably got that wrong. This is true tho':
"If you delete positions 5, 3 and 2, in that order, in
to_search
, you have an exact match at substring10/
insearch_in
, yay!"
Fortunately, I found a guru! So I did
>>> import orc
>>> m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
>>> m
<regex.Match object; span=(27, 30), match='10/', fuzzy_counts=(0, 0, 3)>
>>> near_match = orc.NearMatch.from_regex(m, to_search)
>>> print(near_match)
10/
I
190/
I
1990/
I
1990 /
Hmm... so the order of fuzzy_counts
, is in fact, something, something, insertions?
I'd appreciate if anyone could shed some light on this.
Upvotes: 0
Views: 120
Reputation: 370
you are close. but according to the docs you mentioned in the post, this is what is going on here.
import regex
to_search = "1990 /"
#123456
# ^^ ^
search_in = "V CAD-0000:0000[01] ISS 23/10/91"
#12345678901234567890123456789012
# ^^ ^
m = regex.search(f'({to_search}){{e<4}}', search_in, regex.BESTMATCH)
m
output:
<regex.Match object; span=(27, 30), match='10/', fuzzy_counts=(0, 0, 3)>
m.fuzzy_changes
output:
([], [], [28, 29, 31])
let's break it down step by step:
You're searching for the exact sequence "1990 /" within a longer text "V CAD-0000:0000[01] ISS 23/10/91".
To get an exact match we should have had the longer string as this
V CAD-0000:0000[01] ISS 23/1990 /91
However, there were a few changes made to that string to get the actual string.
V CAD-0000:0000[01] ISS 23/1990 /91
were deleted.V CAD-0000:0000[01] ISS 23/10/91
.Upvotes: 1