Derrick Tay
Derrick Tay

Reputation: 73

Document similarities using spacy (python)

sent1 = nlp("I am happy")

sent2 = nlp("I am sad")

sent3 = nlp("I am joyous")

In here, there are three sentences where I want to compare the similarities and obviously sent1 should be more similar to sent3 compared to with sent1.

sent1.similarity(sent2) = 0.9492464724721577

sent1.similarity(sent3) = 0.9239675481730458

As you can see from the output, sent1 is more similar to sent2, whats wrong with my code?

Upvotes: 3

Views: 3798

Answers (2)

Ricardo Madela
Ricardo Madela

Reputation: 95

This code calculate the similarity of two or more textfiles:

import spacy
import os
import glob

spacy.prefer_gpu()

nlp = spacy.load('pt_core_news_lg') # or nlp = spacy.load('en_core_web_lg')

def get_file_contents(filename):
    try:
        with open(filename, 'r') as filehandle:  
            filecontent = filehandle.read()
            return (filecontent) 
    except Exception as e:
        print(e)

try:
    used = []
    for arquivo1 in glob.glob("F:\\summary\\RESUMO\\*.txt"):
        used.append(arquivo1)
        for arquivo2 in glob.glob("F:\\summary\\RESUMO\\*.txt"):
            if str(arquivo2) not in used:                
                print(arquivo1 + " vs " + arquivo2)
                fn1_doc=get_file_contents(arquivo1)
                doc1 = nlp(fn1_doc)
                fn2_doc=get_file_contents(arquivo2)
                doc2 = nlp(fn2_doc)
                print ("similarity = " + str("%.2f" % (float(doc1.similarity(doc2))*100)) + "%\n") 
except Exception as e:
    print(e)

Upvotes: 1

yvespeirsman
yvespeirsman

Reputation: 3099

There's nothing wrong with your code. Sentence similarity in spaCy is based on word embeddings, and it's a well-known weakness of word embeddings that they have a hard time distinguishing between synonyms (happy-joyous) and antonyms (happy-sad).

Based on your numbers, you might already be doing this, but make sure you're using spaCy's large English model, en_core_web_lg, to get the best word embeddings.

For more accurate embeddings of full sentences, it might be worthwhile checking out alternatives such as Google's universal sentence encoder. See: https://tfhub.dev/google/universal-sentence-encoder/4

Upvotes: 11

Related Questions