Reputation: 83177
Goal:
How can I do so in Python?
Example:
long_string = """1. Bob likes classical music very much.
2. This is classic music!
3. This is a classic musical. It has a lot of classical musics.
"""
query_string = "classical music"
I'd like the Python code to find "classical music" and possibly "classic music", "classic musical" and "classical musics" depending on the string matching threshold I set.
Research: I found Checking fuzzy/approximate substring existing in a longer string, in Python? but the question focuses on the best match only (i.e., not all occurrences) and answers either also focuses on the best match or don't work on multi-word query strings (since the question only had a single-word query strings, or return some incorrect score (doesn't get a perfect score even for an exact match).
Upvotes: 2
Views: 235
Reputation: 5531
Here is a simple class that uses the regex
package. You provide a query with one or more space separated terms in the format: "term{limit}"
(exs: "home{5}"
, "class{4} music{4}"
).
{x}
:limit
non-whitespace characters may appear where the limit is placed"{1}ward{4}"
){x}
(ex: "~ward~~~~"
)import regex
from typing import Iterator
# alias long-winded types
Iter = list|tuple|set
Flags = None|regex.RegexFlag
Match = regex.Match
__all__ = ('FuzzBiz', )
"""
https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109
"""
class FuzzBiz:
CHAR = '~' # substitution and deletion character
# `.sub` replacement method
@staticmethod
def _repl(match:Match) -> str:
return FuzzBiz.CHAR * int(match.group(1))
# format the query for fuzzy matching
@staticmethod
def _term(query:str) -> str:
# where x is an integer, sub "{x}" with x amount of CHAR characters
term = regex.sub(r'\{(\d+)\}', FuzzBiz._repl, query)
limit = term.count(FuzzBiz.CHAR) # determine limit
# return formatted term
return f'({term}){{{limit+1}i+1d+1s<={limit}:\S}}'
""" find all instances that match a singular search
`text` : the text to be searched
`query`: the term to search for
`skip` : Iter of words and/or characters that trigger a skip when found in a result - DEFAULT: []
`flags`: regex flags to apply to this expression - DEFAULT: regex.VERSION1
"""
@staticmethod
def finditer(text:str, query:str, skip:None|Iter=None, flags:Flags=None) -> Iterator:
# default optional arguments
flags = flags or regex.VERSION1
skip = skip or []
# process all query terms and format to pattern
expr = r'\s'.join(map(FuzzBiz._term, query.split(' ')))
expr = fr'\m(?P<result>{expr})\M'
for m in regex.finditer(expr, text, flags=flags):
result = m['result']
# determine if result should be skipped
for item in skip:
if item in result: break
else:
yield m.span(), result
""" make multiple consecutive searches
`text` : the text to be searched
`queries` : Iter of search terms
`*args|**kwargs` : passed to `.finditer` as remaining args
"""
@staticmethod
def iterall(text:str, queries:Iter, *args, **kwargs) -> Iterator:
# query is managed internally
if kwargs.get('query'): del kwargs['query']
for query in queries:
q = query
for span, match in FuzzBiz.finditer(text, query, *args, **kwargs):
yield q, span, match
q = None # only yield `query` the first time it is encountered
.iterall
testdata = """
I headed homeward to meet with the Wardens.
When I arrived, I was greeted by a homely man that told me the homestead was awarded 5 million dollars.
We intend to use some of the homage to create a homeless ward.
The first piece of furniture will be my late-friend Homer's wardrobe.
"""
queries = 'home{5}', '~ward~~~', 'home{4} ward{4}', 'th~~', '{2}e'
for term, span, match in FuzzBiz.iterall(data, queries, flags=regex.I):
if term: print(f'\n{term.upper()}')
print(f' {match}')
HOME{5}
homeward
homely
homestead
homeless
Homer's
~WARD~~~
Wardens
awarded
ward
HOME{4} WARD{4}
homeless ward
Homer's wardrobe
TH~~
the
that
the
the
The
{2}E
the
me
the
We
use
the
The
be
.finditer
testdata = """
I would classify music as one of my favorite hobbies.
I love classical music played by classy musicians for a classic musical.
Beethoven can not be out-classed, music-wise - a man of class, musically gifted.
"""
query = 'class{4} music{5}'
print(f'\n{query.upper()}')
for span, match in FuzzBiz.finditer(data, query, skip=('classify', ',')):
print(f' {match}')
CLASS{4} MUSIC{5}
classical music
classy musicians
classic musical
Upvotes: 1
Reputation: 83177
Reddit user throwaway6560192 pointed me to a good solution: use fuzzysearch.find_near_matches()
.
import pprint
from fuzzysearch import find_near_matches
long_string = """1. Bob likes classical music very much.
2. This is classic music!
3. This is a classic musical. It has a lot of classical musics.
"""
query_string = "classical music"
pprint.pprint(find_near_matches(query_string, long_string, max_l_dist=5))
outputs:
[Match(start=13, end=28, dist=0, matched='classical music'),
Match(start=51, end=64, dist=2, matched='classic music'),
Match(start=79, end=92, dist=2, matched='classic music'),
Match(start=112, end=127, dist=0, matched='classical music')]
And here's is my weak solution with the regex
lib so far:
import regex
long_string = """1. Bob likes classical music very much.
2. This is classic music!
3. This is a classic musical. It has a lot of classical musics.
"""
query_string = "classical music"
threshold = 5
results = regex.finditer(r'(classical music){e<5}', long_string, flags=regex.IGNORECASE)
for result in results:
print(result)
Output:
<regex.Match object; span=(9, 28), match='kes classical music', fuzzy_counts=(0, 4, 0)>
<regex.Match object; span=(49, 64), match='s classic music', fuzzy_counts=(0, 2, 2)>
<regex.Match object; span=(77, 92), match='a classic music', fuzzy_counts=(0, 2, 2)>
<regex.Match object; span=(108, 127), match=' of classical music', fuzzy_counts=(0, 4, 0)>
2 weaknesses of that solution with the regex
lib:
query_string
and threshold
but instead hardcode them in the regex query.Upvotes: 3