Reputation: 73
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyous")
In here, there are three sentences where I want to compare the similarities and obviously sent1
should be more similar to sent3
compared to with sent1
.
sent1.similarity(sent2) = 0.9492464724721577
sent1.similarity(sent3) = 0.9239675481730458
As you can see from the output, sent1
is more similar to sent2
, whats wrong with my code?
Upvotes: 3
Views: 3798
Reputation: 95
This code calculate the similarity of two or more textfiles:
import spacy
import os
import glob
spacy.prefer_gpu()
nlp = spacy.load('pt_core_news_lg') # or nlp = spacy.load('en_core_web_lg')
def get_file_contents(filename):
try:
with open(filename, 'r') as filehandle:
filecontent = filehandle.read()
return (filecontent)
except Exception as e:
print(e)
try:
used = []
for arquivo1 in glob.glob("F:\\summary\\RESUMO\\*.txt"):
used.append(arquivo1)
for arquivo2 in glob.glob("F:\\summary\\RESUMO\\*.txt"):
if str(arquivo2) not in used:
print(arquivo1 + " vs " + arquivo2)
fn1_doc=get_file_contents(arquivo1)
doc1 = nlp(fn1_doc)
fn2_doc=get_file_contents(arquivo2)
doc2 = nlp(fn2_doc)
print ("similarity = " + str("%.2f" % (float(doc1.similarity(doc2))*100)) + "%\n")
except Exception as e:
print(e)
Upvotes: 1
Reputation: 3099
There's nothing wrong with your code. Sentence similarity in spaCy is based on word embeddings, and it's a well-known weakness of word embeddings that they have a hard time distinguishing between synonyms (happy-joyous) and antonyms (happy-sad).
Based on your numbers, you might already be doing this, but make sure you're using spaCy's large English model, en_core_web_lg
, to get the best word embeddings.
For more accurate embeddings of full sentences, it might be worthwhile checking out alternatives such as Google's universal sentence encoder. See: https://tfhub.dev/google/universal-sentence-encoder/4
Upvotes: 11