Python regular expression for strings similarity comparison

Question

I found that SequenceMatcher from library difflib can return a similarity score between two strings. However one of the argument isjunk is little bit tricky to deal with, especially with regular expressions.

Take two strings for example:

a = 'Carrot 500g'
b = 'Cabbage 500g'

from difflib import SequenceMatcher
import re

def similar_0(a, b):
    return SequenceMatcher(None, a, b).ratio()

similar_0(a, b)

def similar_1(a, b):
    return SequenceMatcher(lambda x: bool(re.search(r'\b(\d)+([a-zA-Z])+\b', x)), a, b).ratio()

similar_1(a, b)

When comparing these two strings, I want to ignore all the unit information like "500g" above. But I got the same result using similar_0 vs similar_1. I'm really confused as to how isjunk works in SequenceMatcher as an argument. What is the correct way to achieve the purpose, or any other alternatives?

Python regular expression for strings similarity comparison

Answers (1)

Related Questions