Reputation: 1
I am comparing the content of CVs (.txt files with stop-words already removed) with really compact job descriptions (JDs), like this:
project management, leadership, sales, SAP, marketing
The CVs have around 600 words and the JDs only the words highlighted above.
The problem that I am currently experiencing, and I am sure this is due to my lack of knowledge, is that when I apply similarity measures to it, I get confuse results. For example I have the CV number 1 which contains all the words from the JD, sometimes repeated more than once. I also have CV 2 which only contains the word project in comparsion to the JD. Even though, when I apply cosine similarity, diff, jaccard distance and edit distance, all these measures return to me a higher degree of similarity between the CV2 and the JD, which for me is strange, because only one word is equal between them, while the CV1 possesses all the words from the JD.
I am applying the wrong measures to assess similarity? I am sorry if this is a naive question, I am a beginner with programming.
Codes follow
Diff
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
similar('job.txt','LucasQuadros.txt')
0.43478260869565216
similar('job.txt','BrunaA.Fernandes.txt')
0.2962962962962963
Cosine
from sklearn.feature_extraction.text import TfidfVectorizer
document= ('job.txt','LucasQuadros.txt','BrunaA.Fernandes')
tfidf = TfidfVectorizer().fit_transform(document)
matrix= tfidf * tfidf.T
matrix.todense()
matrix([[1. , 0.36644682, 0. ],
[0.36644682, 1. , 0. ],
[0. , 0. , 1. ]])
Edit distance
import nltk
w1= ('job.txt')
w2= ('LucasQuadros.txt')
w3= ('BrunaA.Fernandes.txt')
nltk.edit_distance(w1,w2)
11
nltk.edit_distance(w1,w3)
16
Jaccard distance
import nltk
a1= set('job.txt')
a2= set('LucasQuadros.txt')
a3= set('BrunaA.Fernandes.txt')
nltk.jaccard_distance(a1,a2)
0.7142857142857143
nltk.jaccard_distance(a1,a3)
0.8125
As you guys can see, the 'LucasQuadros.txt'(CV1) has a higher similarity with the 'job.txt'(Job Description), even though it only contains one word from the job description.
Upvotes: 0
Views: 405
Reputation: 1
I have realized what I have done wrong. When I write a code line as the one bellow, I am comparing the words 'job.txt' with 'LucasQuadros.txt' and not the documents per se.
similar('job.txt','LucasQuadros.txt')
To change that I simply included the .read function on my code as:
jd = open('job.txt')
jd = jd.read()
cv1= ('LucasQuadros.txt')
cv1= cv1.read()
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
similar(jd, cv1)
0.0
similar(jd,cv2)
0.007104795737122558
Now the similarity is correct. As I said it above, it was a pretty beginners mistake.
Upvotes: 0