John Aiton
John Aiton

Reputation: 85

How to iterate through two Python lists to get words similar in Spanish and English

I have two long lists, one with English words,the other with the Spanish translation from google.translate. The order corresponds exactly. e.g. english_list = ['prejudicial','dire','malignant','appalling', 'ratify'] spanish_list =['perjudicial', 'grave', 'maligno', 'atroz','ratificar']

I need to get all the words from the two lists that are more or less similar in terms of the letters

I first through about checking for similar letters at the beginning of the two words, but then realized that in some cases similar words have slightly different beginnings (such as "prejudicial" - "perjudicial")

The desired output is table with two columns under the headings "English" and "Spanish" that have the similar words but excludes those that look different:

English           Spanish


prejudicial       perjudicial
malignant       maligno
ratify               ratificar

Upvotes: 1

Views: 126

Answers (2)

lenik
lenik

Reputation: 23538

First, install: pip install -U python-Levenshtein

Then:

import Levenshtein
for a,b in zip( english, spanish ) :
    if Levenshtein.distance( a, b ) < 3 :    # close enough
        print 'similar words:', a, b

Here's an explanation how levenshtein works: https://en.wikipedia.org/wiki/Levenshtein_distance -- and if you prefer a different similarity metrics, you may do that as well, but this one is quite good and worked well for me in the past.

Levenshtein can calculate the ratio(...) as well:

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), because it's
    based on real minimal edit distance.

Upvotes: 1

han solo
han solo

Reputation: 6590

You could use difflib and check for their similarity ratio like,

$ cat similar.py

from difflib import SequenceMatcher

english_list = ['prejudicial','dire','malignant','appalling', 'ratify']
spanish_list =['perjudicial', 'grave', 'maligno', 'atroz','ratificar']

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()


print('English', 'Spanish')
for eng, span in zip(english_list, spanish_list):
        if similarity(eng, span) >= 0.5:
            print(eng, span)

Output:

$ python3 similar.py
English Spanish
prejudicial perjudicial
malignant maligno
ratify ratificar

As as a side note, depending on your use case, you should check difflib Vs levenshtein

Upvotes: 0

Related Questions