OverflowingTheGlass
OverflowingTheGlass

Reputation: 2434

Make Python Gensim Search Functions Efficient

I have a DataFrame that has a text column. I am splitting the DataFrame into two parts based on the value in another column. One of those parts is indexed into a gensim similarity model. The other part is then fed into the model to find the indexed text that is most similar. This involves a couple of search functions to enumerate over each item in the indexed part. With the toy data, it is fast, but with my real data, it is much too slow using apply. Here is the code example:

import pandas as pd
import gensim
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

d = {'number': [1,2,3,4,5], 'text': ['do you like python', 'do you hate python','do you like apples','who is nelson mandela','i am not interested'], 'answer':['no','yes','no','no','yes']}
df = pd.DataFrame(data=d)

df_yes = df[df['answer']=='yes']

df_no = df[df['answer']=='no']
df_no = df_no.reset_index()

docs = df_no['text'].tolist()
genDocs = [[w.lower() for w in word_tokenize(text)] for text in docs]
dictionary = gensim.corpora.Dictionary(genDocs)
corpus = [dictionary.doc2bow(genDoc) for genDoc in genDocs]
tfidf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.MatrixSimilarity(tfidf[corpus], num_features=len(dictionary))

def search(row):
    query = [w.lower() for w in word_tokenize(row)]
    query_bag_of_words = dictionary.doc2bow(query)
    query_tfidf = tfidf[query_bag_of_words]
    return query_tfidf

def searchAll(row):
    max_similarity = max(sims[search(row)])
    index = [i for i, j in enumerate(sims[search(row)]) if j == max_similarity]
    return max_similarity, index

df_yes = df_yes.copy()

df_yes['max_similarity'], df_yes['index'] = zip(*df_yes['text'].apply(searchAll))

I have tried converting the operations to dask dataframes to no avail, as well as python multiprocessing. How would I make these functions more efficient? Is it possible to vectorize some/all of the functions?

Upvotes: 0

Views: 247

Answers (1)

gojomo
gojomo

Reputation: 54233

Your code's intent and operation is very unclear. Assuming it works, explaining the ultimate goal, and showing more example data, more example queries, and the desired results in your question could help.

Perhaps it could be improved to not repeat certain operations over and over. Some ideas could include:

  • only tokenize each row once, and cache the tokenization
  • only doc2bow() each row once, and cache the BOW representation
  • don't call sims(search[row]) twice inside searchAll()
  • don't iterate twice – once to find the max, then again to find the index – but just once

(More generally, though, efficient text keyword search often uses specialized reverse-indexes for efficiency, to avoid a costly iteration over every document.)

Upvotes: 1

Related Questions