Reputation: 23
I'm trying to compare two different texts—one coming from a Curriculum Vitae (CV) and the other from a job announcement.
After cleaning the texts, I'm trying to compare them to detect if a job announcement is more linked to a specific CV.
I am trying to do this using similarity matching in spaCy via the following code:
similarity = pdf_text.similarity(final_text_from_annonce)
This works well, but I'm getting strange results from two different CVs for the same job announcement. Specifically, I get the same similarity score (~0.6), however, one should clearly be higher than the other.
I checked on spaCy website and I found this very important sentence:
Vector averaging means that the vector of multiple tokens is insensitive to the order of the words. Two documents expressing the same meaning with dissimilar wording will return a lower similarity score than two documents that happen to contain the same words while expressing different meanings.
So, what do I need to use or code to make spaCy compare my two texts based on their meaning instead of the occurrence of words?
I am expecting a parameter for the similarity
function of spaCy, or another function that will compare my both texts and calculate a similarity score based on the meaning of the texts and not if the same words are used.
Upvotes: 2
Views: 1570
Reputation: 3710
The spaCy library by default will use the average of the word embeddings of words in a sentence to determine semantic similarity. This can be thought of as a naive sentence embedding approach. Such an approach could work, but if you were to use it is recommended that you first filter non-meaningful words (e.g. common words) to prevent them from undesirably influencing the final sentence embeddings.
The alternative (and more reliable) solution is to use a different pipeline within spaCy that has been designed to use sentence embeddings created specifically with a dedicated sentence encoder (e.g. the Universal Sentence Encoder (USE) [1] by Cer et al.). Martino Mensio created a package called spacy-universal-sentence-encoder that makes use of this model. Install it via the following command in your command prompt:
pip install spacy-universal-sentence-encoder
Then you can compute the semantic similarity between sentences as follows:
import spacy_universal_sentence_encoder
# Load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']
nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')
# Create two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# Use the similarity method to compare the full documents (i.e. sentences)
print(doc_1.similarity(doc_2)) # Output: 0.9356049733134972
# Or make the comparison using a predefined span of the second document
print(doc_1.similarity(doc_2[0:7])) # Output: 0.9739387861159459
As a side note, when you run the nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')
command for the first time, you may have to do so with administrator rights to allow TensorFlow to create the models
folder in C:\Program Files\Python310\Lib\site-packages\spacy_universal_sentence_encoder
and download the appropriate model. If you don't, it is possible that there will be a PermissionDeniedError
and the code will not run.
[1] Cer, D., Yang, Y., Kong, S.Y., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C. and Sung, Y.H., 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.
Upvotes: 1