Franck Dernoncourt
Franck Dernoncourt

Reputation: 83177

How can I find all exact occurrences of a string, or close matches of it, in a longer string in Python?

Goal:

How can I do so in Python?


Example:

long_string = """1. Bob likes classical music very much.
2. This is classic music!
3. This is a classic musical. It has a lot of classical musics.
"""

query_string = "classical music"

I'd like the Python code to find "classical music" and possibly "classic music", "classic musical" and "classical musics" depending on the string matching threshold I set.


Research: I found Checking fuzzy/approximate substring existing in a longer string, in Python? but the question focuses on the best match only (i.e., not all occurrences) and answers either also focuses on the best match or don't work on multi-word query strings (since the question only had a single-word query strings, or return some incorrect score (doesn't get a perfect score even for an exact match).

Upvotes: 2

Views: 235

Answers (2)

OysterShucker
OysterShucker

Reputation: 5531

Here is a simple class that uses the regex package. You provide a query with one or more space separated terms in the format: "term{limit}" (exs: "home{5}", "class{4} music{4}").


limits {x}:
  1. indicate that 0 to limit non-whitespace characters may appear where the limit is placed
  2. can be placed anywhere in the term (ex: "{1}ward{4}")
  3. can be indicated with "~" instead of {x} (ex: "~ward~~~~")

import regex
from   typing import Iterator

# alias long-winded types
Iter  = list|tuple|set
Flags = None|regex.RegexFlag
Match = regex.Match

__all__ = ('FuzzBiz', )

"""
https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109
"""
class FuzzBiz:
    CHAR = '~' # substitution and deletion character
        
    # `.sub` replacement method    
    @staticmethod
    def _repl(match:Match) -> str:
        return FuzzBiz.CHAR * int(match.group(1))
            
    # format the query for fuzzy matching        
    @staticmethod
    def _term(query:str) -> str:
        # where x is an integer, sub "{x}" with x amount of CHAR characters
        term  = regex.sub(r'\{(\d+)\}', FuzzBiz._repl, query)
        limit = term.count(FuzzBiz.CHAR) # determine limit
        
        # return formatted term
        return f'({term}){{{limit+1}i+1d+1s<={limit}:\S}}'
    
    """ find all instances that match a singular search
        `text` : the text to be searched
        `query`: the term to search for
        `skip` : Iter of words and/or characters that trigger a skip when found in a result - DEFAULT: []
        `flags`: regex flags to apply to this expression                                    - DEFAULT: regex.VERSION1
    """        
    @staticmethod
    def finditer(text:str, query:str, skip:None|Iter=None, flags:Flags=None) -> Iterator:
        # default optional arguments
        flags = flags or regex.VERSION1
        skip  = skip  or []
        
        # process all query terms and format to pattern
        expr = r'\s'.join(map(FuzzBiz._term, query.split(' ')))
        expr = fr'\m(?P<result>{expr})\M'
        
        for m in regex.finditer(expr, text, flags=flags):
            result = m['result']
            
            # determine if result should be skipped
            for item in skip:
                if item in result: break
            else:
                yield m.span(), result
     
    """ make multiple consecutive searches
        `text`           : the text to be searched
        `queries`        : Iter of search terms
        `*args|**kwargs` : passed to `.finditer` as remaining args
    """    
    @staticmethod
    def iterall(text:str, queries:Iter, *args, **kwargs) -> Iterator:
        # query is managed internally
        if kwargs.get('query'): del kwargs['query']
        
        for query in queries:
            q = query
                
            for span, match in FuzzBiz.finditer(text, query, *args, **kwargs):
                yield q, span, match
                q = None # only yield `query` the first time it is encountered

            

.iterall test
data = """ 
I headed homeward to meet with the Wardens. 
When I arrived, I was greeted by a homely man that told me the homestead was awarded 5 million dollars.
We intend to use some of the homage to create a homeless ward. 
The first piece of furniture will be my late-friend Homer's wardrobe.
"""
queries = 'home{5}', '~ward~~~', 'home{4} ward{4}', 'th~~', '{2}e'

for term, span, match in FuzzBiz.iterall(data, queries, flags=regex.I):
    if term: print(f'\n{term.upper()}')
    print(f'  {match}') 
output
HOME{5}
  homeward
  homely
  homestead
  homeless
  Homer's

~WARD~~~
  Wardens
  awarded
  ward

HOME{4} WARD{4}
  homeless ward
  Homer's wardrobe

TH~~
  the
  that
  the
  the
  The

{2}E
  the
  me
  the
  We
  use
  the
  The
  be

.finditer test
data = """ 
I would classify music as one of my favorite hobbies. 
I love classical music played by classy musicians for a classic musical. 
Beethoven can not be out-classed, music-wise - a man of class, musically gifted.
"""
query = 'class{4} music{5}'

print(f'\n{query.upper()}')
for span, match in FuzzBiz.finditer(data, query, skip=('classify', ',')):  
    print(f'  {match}')
output
CLASS{4} MUSIC{5}
  classical music
  classy musicians
  classic musical

Upvotes: 1

Franck Dernoncourt
Franck Dernoncourt

Reputation: 83177

Reddit user throwaway6560192 pointed me to a good solution: use fuzzysearch.find_near_matches().

import pprint
from fuzzysearch import find_near_matches

long_string = """1. Bob likes classical music very much.
2. This is classic music!
3. This is a classic musical. It has a lot of classical musics.
"""

query_string = "classical music"
pprint.pprint(find_near_matches(query_string, long_string, max_l_dist=5))

outputs:

[Match(start=13, end=28, dist=0, matched='classical music'),
 Match(start=51, end=64, dist=2, matched='classic music'),
 Match(start=79, end=92, dist=2, matched='classic music'),
 Match(start=112, end=127, dist=0, matched='classical music')]

And here's is my weak solution with the regex lib so far:

import regex
long_string = """1. Bob likes classical music very much.
2. This is classic music!
3. This is a classic musical. It has a lot of classical musics.
"""

query_string = "classical music"
threshold = 5

results = regex.finditer(r'(classical music){e<5}', long_string, flags=regex.IGNORECASE)

for result in results:
    print(result)

Output:

<regex.Match object; span=(9, 28), match='kes classical music', fuzzy_counts=(0, 4, 0)>
<regex.Match object; span=(49, 64), match='s classic music', fuzzy_counts=(0, 2, 2)>
<regex.Match object; span=(77, 92), match='a classic music', fuzzy_counts=(0, 2, 2)>
<regex.Match object; span=(108, 127), match=' of classical music', fuzzy_counts=(0, 4, 0)>

2 weaknesses of that solution with the regex lib:

  1. Doesn't use query_string and threshold but instead hardcode them in the regex query.
  2. Doesn't get best matches but instead get matches with 4 errors.

Upvotes: 3

Related Questions