Lucas
Lucas

Reputation: 1

Text similarity approaches do not reflect "real" similarity between texts

I am comparing the content of CVs (.txt files with stop-words already removed) with really compact job descriptions (JDs), like this:

project management, leadership, sales, SAP, marketing

The CVs have around 600 words and the JDs only the words highlighted above.

The problem that I am currently experiencing, and I am sure this is due to my lack of knowledge, is that when I apply similarity measures to it, I get confuse results. For example I have the CV number 1 which contains all the words from the JD, sometimes repeated more than once. I also have CV 2 which only contains the word project in comparsion to the JD. Even though, when I apply cosine similarity, diff, jaccard distance and edit distance, all these measures return to me a higher degree of similarity between the CV2 and the JD, which for me is strange, because only one word is equal between them, while the CV1 possesses all the words from the JD.

I am applying the wrong measures to assess similarity? I am sorry if this is a naive question, I am a beginner with programming.

Codes follow

Diff

    from difflib import SequenceMatcher
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()
    similar('job.txt','LucasQuadros.txt')
    0.43478260869565216
    similar('job.txt','BrunaA.Fernandes.txt')
    0.2962962962962963

Cosine

    from sklearn.feature_extraction.text import TfidfVectorizer
    document= ('job.txt','LucasQuadros.txt','BrunaA.Fernandes')
    tfidf = TfidfVectorizer().fit_transform(document)
    matrix= tfidf * tfidf.T
    matrix.todense()
    matrix([[1.        , 0.36644682, 0.        ],
    [0.36644682, 1.        , 0.        ],
    [0.        , 0.        , 1.        ]])

Edit distance

    import nltk
    w1= ('job.txt')
    w2= ('LucasQuadros.txt')
    w3= ('BrunaA.Fernandes.txt')
    nltk.edit_distance(w1,w2)
    11
    nltk.edit_distance(w1,w3)
    16

Jaccard distance

    import nltk
    a1= set('job.txt')
    a2= set('LucasQuadros.txt')
    a3= set('BrunaA.Fernandes.txt')
    nltk.jaccard_distance(a1,a2)
    0.7142857142857143
    nltk.jaccard_distance(a1,a3)
    0.8125

As you guys can see, the 'LucasQuadros.txt'(CV1) has a higher similarity with the 'job.txt'(Job Description), even though it only contains one word from the job description.

Upvotes: 0

Views: 405

Answers (1)

Lucas
Lucas

Reputation: 1

I have realized what I have done wrong. When I write a code line as the one bellow, I am comparing the words 'job.txt' with 'LucasQuadros.txt' and not the documents per se.

similar('job.txt','LucasQuadros.txt')

To change that I simply included the .read function on my code as:

jd = open('job.txt')
jd = jd.read()
cv1= ('LucasQuadros.txt')
cv1= cv1.read()

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

similar(jd, cv1)
0.0
similar(jd,cv2)
0.007104795737122558

Now the similarity is correct. As I said it above, it was a pretty beginners mistake.

Upvotes: 0

Related Questions