Make Python Gensim Search Functions Efficient

Question

I have a DataFrame that has a text column. I am splitting the DataFrame into two parts based on the value in another column. One of those parts is indexed into a gensim similarity model. The other part is then fed into the model to find the indexed text that is most similar. This involves a couple of search functions to enumerate over each item in the indexed part. With the toy data, it is fast, but with my real data, it is much too slow using apply. Here is the code example:

import pandas as pd
import gensim
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

d = {'number': [1,2,3,4,5], 'text': ['do you like python', 'do you hate python','do you like apples','who is nelson mandela','i am not interested'], 'answer':['no','yes','no','no','yes']}
df = pd.DataFrame(data=d)

df_yes = df[df['answer']=='yes']

df_no = df[df['answer']=='no']
df_no = df_no.reset_index()

docs = df_no['text'].tolist()
genDocs = [[w.lower() for w in word_tokenize(text)] for text in docs]
dictionary = gensim.corpora.Dictionary(genDocs)
corpus = [dictionary.doc2bow(genDoc) for genDoc in genDocs]
tfidf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.MatrixSimilarity(tfidf[corpus], num_features=len(dictionary))

def search(row):
    query = [w.lower() for w in word_tokenize(row)]
    query_bag_of_words = dictionary.doc2bow(query)
    query_tfidf = tfidf[query_bag_of_words]
    return query_tfidf

def searchAll(row):
    max_similarity = max(sims[search(row)])
    index = [i for i, j in enumerate(sims[search(row)]) if j == max_similarity]
    return max_similarity, index

df_yes = df_yes.copy()

df_yes['max_similarity'], df_yes['index'] = zip(*df_yes['text'].apply(searchAll))

I have tried converting the operations to dask dataframes to no avail, as well as python multiprocessing. How would I make these functions more efficient? Is it possible to vectorize some/all of the functions?

Make Python Gensim Search Functions Efficient

Answers (1)

Related Questions