Reputation: 608
I have a dataframe in pandas of organisation descriptions and project titles, shown below:
Columns are df['org_name']
, df['org_description']
, df['proj_title']
. I want to add a column with the similarity score between the organisation description and project title, for each project(each row).
I'm trying to use gensim
: https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html. However, I'm not sure how to adapt the tutorial for my use case, because in the tutorial we get a new query doc = "Human computer interaction"
and then compared that against the documents in the corpus individually. Not sure where this choice is made (sims
? vec_lsi
?)
But I want the similarity score for just the two items in a given row of dataframe df
, not one of them against the whole corpus, for each row and then append that to df
as a column. How can I do this?
Upvotes: 1
Views: 1154
Reputation: 2126
Here is an adaptation of the Gensim LSI tutorial, where the description represents a corpus of sentences and the title is the query made against it.
from gensim.models import LsiModel
from collections import defaultdict
from gensim import corpora
def desc_title_sim(desc, title):
# remove common words and tokenize
stoplist = set('for a of the and to in'.split()) # add a longer stoplist here
sents = desc.split('.') # crude sentence tokenizer
texts = [
[word for word in sent.lower().split() if word not in stoplist]
for sent in sents
]
# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = LsiModel(corpus, id2word=dictionary, num_topics=2)
vec_bow = dictionary.doc2bow(title.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
return vec_lsi
Apply the function row-wise to get similarity:
df['sim'] = df.apply(lambda row: desc_title_sim(row['org_description'], row['proj_title']), axis=1)
The newly created sim column will be populated with values like
[(0, 0.4618210045327158), (1, 0.07002766527900064)]
Upvotes: 1