James Wong
James Wong

Reputation: 1137

Python regular expression for strings similarity comparison

I found that SequenceMatcher from library difflib can return a similarity score between two strings. However one of the argument isjunk is little bit tricky to deal with, especially with regular expressions.

Take two strings for example:

a = 'Carrot 500g'
b = 'Cabbage 500g'

from difflib import SequenceMatcher
import re

def similar_0(a, b):
    return SequenceMatcher(None, a, b).ratio()

similar_0(a, b)

def similar_1(a, b):
    return SequenceMatcher(lambda x: bool(re.search(r'\b(\d)+([a-zA-Z])+\b', x)), a, b).ratio()

similar_1(a, b)

When comparing these two strings, I want to ignore all the unit information like "500g" above. But I got the same result using similar_0 vs similar_1. I'm really confused as to how isjunk works in SequenceMatcher as an argument. What is the correct way to achieve the purpose, or any other alternatives?

Upvotes: 4

Views: 2625

Answers (1)

Aran-Fey
Aran-Fey

Reputation: 43196

Your regex doesn't work because SequenceMatcher passes individual characters to the isjunk function, not words:

>>> SequenceMatcher(print, 'Carrot 500g', 'Cabbage 500g')
b
0
5
a
e

g
C

You should just remove the junk from both strings before passing them to SequenceMatcher:

a = re.sub(r'\b(\d)+([a-zA-Z])+\b', '', a)
b = re.sub(r'\b(\d)+([a-zA-Z])+\b', '', b)
print(similar_0(a, b))

Upvotes: 4

Related Questions