Reputation: 85
I have two long lists, one with English words,the other with the Spanish translation from google.translate. The order corresponds exactly. e.g. english_list = ['prejudicial','dire','malignant','appalling', 'ratify'] spanish_list =['perjudicial', 'grave', 'maligno', 'atroz','ratificar']
I need to get all the words from the two lists that are more or less similar in terms of the letters
I first through about checking for similar letters at the beginning of the two words, but then realized that in some cases similar words have slightly different beginnings (such as "prejudicial" - "perjudicial")
The desired output is table with two columns under the headings "English" and "Spanish" that have the similar words but excludes those that look different:
English Spanish
prejudicial perjudicial
malignant maligno
ratify ratificar
Upvotes: 1
Views: 126
Reputation: 23538
First, install: pip install -U python-Levenshtein
Then:
import Levenshtein
for a,b in zip( english, spanish ) :
if Levenshtein.distance( a, b ) < 3 : # close enough
print 'similar words:', a, b
Here's an explanation how levenshtein
works: https://en.wikipedia.org/wiki/Levenshtein_distance -- and if you prefer a different similarity metrics, you may do that as well, but this one is quite good and worked well for me in the past.
Levenshtein can calculate the ratio(...)
as well:
ratio(string1, string2)
The similarity is a number between 0 and 1, it's usually equal or
somewhat higher than difflib.SequenceMatcher.ratio(), because it's
based on real minimal edit distance.
Upvotes: 1
Reputation: 6590
You could use difflib
and check for their similarity ratio
like,
$ cat similar.py
from difflib import SequenceMatcher
english_list = ['prejudicial','dire','malignant','appalling', 'ratify']
spanish_list =['perjudicial', 'grave', 'maligno', 'atroz','ratificar']
def similarity(a, b):
return SequenceMatcher(None, a, b).ratio()
print('English', 'Spanish')
for eng, span in zip(english_list, spanish_list):
if similarity(eng, span) >= 0.5:
print(eng, span)
Output:
$ python3 similar.py
English Spanish
prejudicial perjudicial
malignant maligno
ratify ratificar
As as a side note, depending on your use case, you should check difflib Vs levenshtein
Upvotes: 0