Mobeus Zoom
Mobeus Zoom

Reputation: 608

Gensim for similarities

I have a dataframe in pandas of organisation descriptions and project titles, shown below:enter image description here

Columns are df['org_name'], df['org_description'], df['proj_title']. I want to add a column with the similarity score between the organisation description and project title, for each project(each row).

I'm trying to use gensim: https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html. However, I'm not sure how to adapt the tutorial for my use case, because in the tutorial we get a new query doc = "Human computer interaction" and then compared that against the documents in the corpus individually. Not sure where this choice is made (sims? vec_lsi?)

But I want the similarity score for just the two items in a given row of dataframe df, not one of them against the whole corpus, for each row and then append that to df as a column. How can I do this?

Upvotes: 1

Views: 1154

Answers (1)

thorntonc
thorntonc

Reputation: 2126

Here is an adaptation of the Gensim LSI tutorial, where the description represents a corpus of sentences and the title is the query made against it.

from gensim.models import LsiModel
from collections import defaultdict
from gensim import corpora

def desc_title_sim(desc, title):
    # remove common words and tokenize
    stoplist = set('for a of the and to in'.split())  # add a longer stoplist here
    sents = desc.split('.')  # crude sentence tokenizer
    texts = [
        [word for word in sent.lower().split() if word not in stoplist]
        for sent in sents
    ]

    # remove words that appear only once
    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    texts = [
        [token for token in text if frequency[token] > 1]
        for text in texts
    ]

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]

    lsi = LsiModel(corpus, id2word=dictionary, num_topics=2)

    vec_bow = dictionary.doc2bow(title.lower().split())
    vec_lsi = lsi[vec_bow]  # convert the query to LSI space
    return vec_lsi

Apply the function row-wise to get similarity:

df['sim'] = df.apply(lambda row: desc_title_sim(row['org_description'], row['proj_title']), axis=1)

The newly created sim column will be populated with values like

[(0, 0.4618210045327158), (1, 0.07002766527900064)]

Upvotes: 1

Related Questions