Youness Drissi Slimani
Youness Drissi Slimani

Reputation: 149

Calculate similarity between list of words

I want to calculate the similarity between two list of words, for example :

['email','user','this','email','address','customer']

is similar to this list:

['email','mail','address','netmail']

I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list.

Upvotes: 5

Views: 15131

Answers (3)

Greg Nelson
Greg Nelson

Reputation: 87

I'm suggesting this answer because the title of the question may bring people here looking to solve a related but different problem. If you only care about whether the words are present or absent, then one approach is Jaccard similarity. Although this can be found in a number of toolkits, it's also very easy to compute directly in Python:

def jaccard(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    u = s1 | s2
    if u:
        return float(len(s1 & s2))/float(len(u))
    else:
        return 0.0

list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
list_C = ['address','ip','network']
jaccard(list_A, list_B)
jaccard(list_A, list_C)

Output

0.2857142857142857
0.14285714285714285

The result if the sets are both empty isn't actually defined for Jaccard, so this checks for that and says they are not similar if they're both empty, but you could also say that makes them perfectly similar (1.0). You can decide if you want to convert these values (0-1) into percentages on printout.

Neither of them returns anything close to 80%, but that's because this approach is only looking for the words to match exactly, and not looking for "near matches" like "email", "mail", and "netmail". For that you need something like nltk, for example nltk.corpus.reader.wordnet. It's also insensitive to the fact that 'email' appears twice in list_A, but then I don't think it is clear from the question how that is supposed to be treated: when it appears twice in A and only once in B, does that increase the similarity (because there are multiple matching pairs) or decrease the similarity (because you want to word frequencies to be similar between the sets)?

Upvotes: 0

KRKirov
KRKirov

Reputation: 4004

You can leverage the power of Scikit-Learn (or other NLP) libraries to accomplish this. The example below uses CountVectorizer, but for more sophisticated analysis of documents it might be preferable to use the TFIDF vectorizer instead.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def vect_cos(vect, test_list):
    """ Vectorise text and compute the cosine similarity """
    query_0 = vect.transform([' '.join(vect.get_feature_names())])
    query_1 = vect.transform(test_list)
    cos_sim = cosine_similarity(query_0.A, query_1.A)  # displays the resulting matrix
    return query_1, np.round(cos_sim.squeeze(), 3)

# Train the vectorizer
vocab=['email','user','this','email','address','customer']
vectoriser = CountVectorizer().fit(vocab)
vectoriser.vocabulary_ # show the word-matrix position pairs

# Analyse  list_1
list_1 = ['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])

# Analyse list_2
list_2 = ['address','ip','network']
list_2_vect, list_2_cos = vect_cos(vectoriser, [' '.join(list_2)])

print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
print('\nThe cosine similarity for the second list is {}.'.format(list_2_cos))

Output

The cosine similarity for the first list is 0.632.

The cosine similarity for the second list is 0.447.

Edit

If you want to calculate the cosine similarity between "e-mail" and any other list of strings, train the vectoriser with "e-mail" and then analyse other documents.

# Train the vectorizer
vocab=['email']
vectoriser = CountVectorizer().fit(vocab)

# Analyse  list_1
list_1 =['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))

Output

The cosine similarity for the first list is 1.0.

Upvotes: 5

DirtyBit
DirtyBit

Reputation: 16772

Since you haven't really been able to demonstrate a crystal output, here is my best shot:

list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']

In the above two list, we will find the cosine similarity between each element of the list with the rest. i.e. email from list_B with every element in list_A:

def word2vec(word):
    from collections import Counter
    from math import sqrt

    # count the characters in word
    cw = Counter(word)
    # precomputes a set of the different characters
    sw = set(cw)
    # precomputes the "length" of the word vector
    lw = sqrt(sum(c*c for c in cw.values()))

    # return a tuple
    return cw, sw, lw

def cosdis(v1, v2):
    # which characters are common to the two words?
    common = v1[1].intersection(v2[1])
    # by definition of cosine distance we have
    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]


list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']

threshold = 0.80     # if needed
for key in list_A:
    for word in list_B:
        try:
            # print(key)
            # print(word)
            res = cosdis(word2vec(word), word2vec(key))
            # print(res)
            print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
            # if res > threshold:
            #     print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
        except IndexError:
            pass

OUTPUT:

The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : user is: 22.360679774997898
The cosine similarity between : mail and : user is: 0.0
The cosine similarity between : address and : user is: 60.30226891555272
The cosine similarity between : netmail and : user is: 18.89822365046136
The cosine similarity between : email and : this is: 22.360679774997898
The cosine similarity between : mail and : this is: 25.0
The cosine similarity between : address and : this is: 30.15113445777636
The cosine similarity between : netmail and : this is: 37.79644730092272
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : address is: 26.967994498529684
The cosine similarity between : mail and : address is: 15.07556722888818
The cosine similarity between : address and : address is: 100.0
The cosine similarity between : netmail and : address is: 22.79211529192759
The cosine similarity between : email and : customer is: 31.62277660168379
The cosine similarity between : mail and : customer is: 17.677669529663685
The cosine similarity between : address and : customer is: 42.640143271122085
The cosine similarity between : netmail and : customer is: 40.08918628686365

Note: I have also commented the threshold part in the code, in case you only want the words if their similarity exceeds a certain threshold i.e. 80%

EDIT:

OP: but what i want exactly to do in not the comparaison word by word but, list by list

Using Counter and math:

from collections import Counter
import math

counterA = Counter(list_A)
counterB = Counter(list_B)


def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

print(counter_cosine_similarity(counterA, counterB) * 100)

OUTPUT:

53.03300858899106

Upvotes: 13

Related Questions