SantoshGupta7
SantoshGupta7

Reputation: 6197

How to get all fuzzy matching substrings between two strings in python?

Say I have three example strings

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

If I got all the matching substrings of text2 and text3 with text1, I would get

text1_text2_common = [
    '3 days ago.',
]

text2_text3_common = [
    'of',
]

text1_text3_common = [
    'was',
    'idx'
    'every'
    'hours'
]

What I am looking for is a fuzzy matching, using something like the Levenshtein distance. So even if the substrings are not exact, if they are similar enough for a criteria, it would get selected as a substring.

So ideally I am looking for something like this:

text1_text3_common_fuzzy = [
    'prescription of idx, 20mg to be given every four hours'
]

Upvotes: 5

Views: 2413

Answers (4)

Viacheslav Ivannikov
Viacheslav Ivannikov

Reputation: 732

short answer:

def match(a,b):
    a,b = a.lower(), b.lower()
    error = 0
    for i in string.ascii_lowercase:
            error += abs(a.count(i) - b.count(i))
    total = len(a) + len(b)
    return (total-error)/total
def what_you_need(s1, s2, fuziness=0.8):
    if match(s1, s2) < fuziness:
        syms = []
        for s in s1:
            if s in s2:
                syms.append(s)
        return "".join(syms)

Upvotes: 1

plunker
plunker

Reputation: 1387

The most performant algorithm for this would be constructing a finite state transducer and utilizing it for fuzzy matching.

However there are other approaches that are simpler such as prefix tries.

Check out the links below to read much more knowledgeable people regarding this write extensively on the topic.

Peter Norvig's How to Write a Spelling Corrector contains examples of implementing this for something like Google's search bar that automatically corrects mispellings and links to examples in other languages.

Kay Schlühr's articles on fuzzy string matching cover this topic in much more depth and with much higher quality than I can put together.

If you are more interested in how Lucene, the backing library of ElasticSearch, does this, Michael McCandless wrote a blog post on FSTs and an academic paper on the topic.

If you are looking for the (current) bleeding edge, checkout pisa which has state of the art information retrieval algorithms.

Upvotes: 1

ferdy
ferdy

Reputation: 5034

Here is a code to calculate the similarity by fuzzy ratio between the sub-string of string1 and full-string of string2. The code can also handle sub-string of string2 and full-string of string1 and also sub-string of string1 and sub-string of string2.

This one uses nltk to generate ngrams.

Typical algorithm:

  1. Generate ngrams from the given first string.
    Example:
    text2 = "The time of discomfort was 3 days ago."
    total_length = 8

In the code the param has values 5, 6, 7, 8.
param = 5
ngrams = ['The time of discomfort was', 'time of discomfort was 3', 'of discomfort was 3 days', 'discomfort was 3 days ago.']

  1. Compare it with second string.
    Example:
    text1 = Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

@param=5

  • compare The time of discomfort was vs text1 and get the fuzzy score
  • compare time of discomfort was 3 vs text1 and get the fuzzy score
  • and so on until all elements in ngrams_5 are finished
    Save sub-string if fuzzy score is greater than or equal to given threshold.

@param=6

  • compare The time of discomfort was 3 vs text1 and get the fuzzy score
  • and so on

until @param=8

You can revise the code changing n_start to 5 or so, so that the ngrams of string1 will be compared to the ngrams of string2, in this case this is a comparison of sub-string of string1 and sub-string of string2.

# Generate ngrams for string2
n_start = 5  # st2_length
for n in range(n_start, st2_length + 1):
    ...

For comparison I use:

fratio = fuzz.token_set_ratio(fs1, fs2)

Have a look at this also. You can try different ratios as well.

Your sample 'prescription of idx, 20mg to be given every four hours' has a fuzzy score of 52.

See sample console output.

7                    prescription of idx, 20mg to be given every four hours           52

Code

"""
fuzzy_match.py

https://stackoverflow.com/questions/72017146/how-to-get-all-fuzzy-matching-substrings-between-two-strings-in-python

Dependent modules:
    pip install pandas
    pip install nltk
    pip install fuzzywuzzy
    pip install python-Levenshtein

"""


from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz


# Sample strings.
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"


def myprocess(st1: str, st2: str, threshold):
    """
    Generate sub-strings from st1 and compare with st2.
    The sub-strings, full string and fuzzy ratio will be saved in csv file.
    """
    data = []
    st1_length = len(st1.split())
    st2_length = len(st2.split())

    # Generate ngrams for string1
    m_start = 5
    for m in range(m_start, st1_length + 1):  # st1_length >= m_start

        # If m=3, fs1 = 'Patient has checked', 'has checked in', 'checked in for' ...
        # If m=5, fs1 = 'Patient has checked in for', 'has checked in for abdominal', ...
        for s1 in ngrams(st1.split(), m):
            fs1 = ' '.join(s1)
            
            # Generate ngrams for string2
            n_start = st2_length
            for n in range(n_start, st2_length + 1):
                for s2 in ngrams(st2.split(), n):
                    fs2 = ' '.join(s2)

                    fratio = fuzz.token_set_ratio(fs1, fs2)  # there are other ratios

                    # Save sub string if ratio is within threshold.
                    if fratio >= threshold:
                        data.append([fs1, fs2, fratio])

    return data


def get_match(sub, full, colname1, colname2, threshold=50):
    """
    sub: is a string where we extract the sub-string.
    full: is a string as the base/reference.
    threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
    """   
    save = myprocess(sub, full, threshold)

    df = pd.DataFrame(save)
    if len(df):
        df.columns = [colname1, colname2, 'fuzzy_ratio']

        is_sort_by_fuzzy_ratio_first = True

        if is_sort_by_fuzzy_ratio_first:
            df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
        else:
            df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])

        df = df.reset_index(drop=True)

        df.to_csv(f'{colname1}_{colname2}.csv', index=False)

        # Print to console. Show only the sub-string and the fuzzy ratio. High ratio implies high similarity.
        df1 = df[[colname1, 'fuzzy_ratio']]
        print(df1.to_string())
        print()

        print(f'sub: {sub}')
        print(f'base: {full}')
        print()


def main():
    get_match(text2, text1, 'string2', 'string1', threshold=50)  # output string2_string1.csv
    get_match(text3, text1, 'string3', 'string1', threshold=50)

    get_match(text2, text3, 'string2', 'string3', threshold=10)

    # Other param combo.


if __name__ == '__main__':
    main()

Console Output

                                  string2  fuzzy_ratio
0              discomfort was 3 days ago.           72
1           of discomfort was 3 days ago.           67
2      time of discomfort was 3 days ago.           60
3                of discomfort was 3 days           59
4  The time of discomfort was 3 days ago.           55
5           time of discomfort was 3 days           51

sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                                                    string3  fuzzy_ratio
0                                                 be given every four hours           61
1                                    idx, 20mg to be given every four hours           58
2        was given a prescription of idx, 20mg to be given every four hours           56
3                                              to be given every four hours           56
4   John was given a prescription of idx, 20mg to be given every four hours           56
5                                 of idx, 20mg to be given every four hours           55
6              was given a prescription of idx, 20mg to be given every four           52
7                    prescription of idx, 20mg to be given every four hours           52
8            given a prescription of idx, 20mg to be given every four hours           52
9                  a prescription of idx, 20mg to be given every four hours           52
10        John was given a prescription of idx, 20mg to be given every four           52
11                                              idx, 20mg to be given every           51
12                                        20mg to be given every four hours           50

sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                  string2  fuzzy_ratio
0      time of discomfort was 3 days ago.           41
1           time of discomfort was 3 days           41
2                time of discomfort was 3           40
3                of discomfort was 3 days           40
4  The time of discomfort was 3 days ago.           40
5           of discomfort was 3 days ago.           39
6       The time of discomfort was 3 days           39
7              The time of discomfort was           38
8            The time of discomfort was 3           35
9              discomfort was 3 days ago.           34

sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours

Sample CSV output

string2_string1.csv

enter image description here

Using Spacy similarity

Here is the result of the comparison between sub-string of text3 and full text of text1 using spacy.

The result below is intended to be compared with the 2nd table above to see which method presents a better ranking of similarity.

I use the large model to get the result below.

Code

import spacy
import pandas as pd


nlp = spacy.load("en_core_web_lg")

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

text3_sub = [
    'be given every four hours', 'idx, 20mg to be given every four hours',
    'was given a prescription of idx, 20mg to be given every four hours',
    'to be given every four hours',
    'John was given a prescription of idx, 20mg to be given every four hours',
    'of idx, 20mg to be given every four hours',
    'was given a prescription of idx, 20mg to be given every four',
    'prescription of idx, 20mg to be given every four hours',
    'given a prescription of idx, 20mg to be given every four hours',
    'a prescription of idx, 20mg to be given every four hours',
    'John was given a prescription of idx, 20mg to be given every four',
    'idx, 20mg to be given every',
    '20mg to be given every four hours'
]


data = []
for s in text3_sub:
    doc1 = nlp(s)
    doc2 = nlp(text1)
    sim = round(doc1.similarity(doc2), 3)
    data.append([s, text1, sim])

df = pd.DataFrame(data)
df.columns = ['from text3', 'text1', 'similarity']
df = df.sort_values(by=['similarity'], ascending=[False])
df = df.reset_index(drop=True)

df1 = df[['from text3', 'similarity']]
print(df1.to_string())

print()
print(f'text3: {text3}')
print(f'text1: {text1}')

Output

                                                                 from text3  similarity
0        was given a prescription of idx, 20mg to be given every four hours       0.904
1   John was given a prescription of idx, 20mg to be given every four hours       0.902
2                  a prescription of idx, 20mg to be given every four hours       0.895
3                    prescription of idx, 20mg to be given every four hours       0.893
4            given a prescription of idx, 20mg to be given every four hours       0.892
5                                 of idx, 20mg to be given every four hours       0.889
6                                    idx, 20mg to be given every four hours       0.883
7              was given a prescription of idx, 20mg to be given every four       0.879
8         John was given a prescription of idx, 20mg to be given every four       0.877
9                                         20mg to be given every four hours       0.877
10                                              idx, 20mg to be given every       0.835
11                                             to be given every four hours       0.834
12                                                be given every four hours       0.832

text3: John was given a prescription of idx, 20mg to be given every four hours
text1: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

It looks like the spacy method produces a nice ranking of similarity.

Upvotes: 7

D-E-N
D-E-N

Reputation: 1272

Maybe you can use this as an entry to your solution. You can split your texts into words an group them and compare them to the other substring (with same size) and return them above a specific ratio. I removed commas and dots, because they are not relevant for me. You can use any other comparison tool instead of SequenceMatcher (in some you do not have to split both sides into equal size substrings. You may have a look at fuzzywuzzy).

You have to play with the ratio, to get the result you want. Additionally you have to look through the results to remove results, that are a substring of a other result. This depends on your needs:

from difflib import SequenceMatcher

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"


def fuzzy_match(t1: str, t2: str, min_len: int = 3, max_len: int = 10, ratio_to_get: int = 0.6):
    t1 = t1.replace(".", "").replace(",", "")
    t2 = t2.replace(".", "").replace(",", "")

    result = set()
    t2_splitted = t2.split(" ")
    t1_splitted = t1.split(" ")
    for count in range(min_len, max_len, 1):
        for pos_2 in range(len(t2_splitted) - count + 1):
            substr_2 = " ".join(t2_splitted[pos_2: pos_2 + count])
            for pos_1 in range(len(t1_splitted) - count + 1):
                substr_1 = " ".join(t1_splitted[pos_1: pos_1 + count])
                ratio = SequenceMatcher(None, substr_1, substr_2).ratio()
                if ratio >= ratio_to_get:
                    result.add(substr_1)

    return result


if __name__ == '__main__':
    print(fuzzy_match(text1, text2))
    print(fuzzy_match(text2, text1))
    print(fuzzy_match(text1, text3))
    print(fuzzy_match(text3, text1))
    print(fuzzy_match(text2, text3))
    print(fuzzy_match(text3, text2))

outputs:

{'days ago Patient', '3 days ago', 'started 3 days ago', 'which started 3 days ago', '3 days ago Patient', 'started 3 days'}
{'was 3 days', '3 days ago', 'discomfort was 3 days ago', 'was 3 days ago'}
{'prescribed idx 20 mg', 'mg every 4 hours', 'idx 20 mg every 4', '20 mg every 4 hours', 'Patient was prescribed idx 20 mg every 4', 'ago Patient was prescribed idx 20 mg every 4', 'ago Patient was prescribed idx 20 mg', 'was prescribed idx 20', 'ago Patient was prescribed idx 20', 'ago Patient was prescribed idx 20 mg every', 'prescribed idx 20', 'was prescribed idx 20 mg every', 'Patient was prescribed idx 20 mg', 'ago Patient was prescribed idx', 'was prescribed idx 20 mg', 'Patient was prescribed', 'prescribed idx 20 mg every', 'idx 20 mg', 'Patient was prescribed idx 20 mg every', 'was prescribed idx 20 mg every 4', 'idx 20 mg every', 'mg every 4', '20 mg every', 'Patient was prescribed idx 20', 'prescribed idx 20 mg every 4', 'every 4 hours'}
{'of idx 20mg to', 'given a prescription of idx 20mg to be', 'idx 20mg to be given', 'given a prescription of idx', 'be given every', 'idx 20mg to', 'every four hours', 'given every four', 'given a prescription of idx 20mg to', 'prescription of idx 20mg to be', 'a prescription of idx 20mg', 'prescription of idx 20mg to', 'prescription of idx 20mg', 'given a prescription of idx 20mg to be given', 'a prescription of idx 20mg to be', 'idx 20mg to be', 'given a prescription', 'to be given every four hours', 'be given every four', 'given every four hours', 'a prescription of idx 20mg to', 'be given every four hours', 'given a prescription of idx 20mg', 'a prescription of idx', 'prescription of idx', 'of idx 20mg'}
set()
set()

Upvotes: 1

Related Questions