Reputation: 2434
I have a DataFrame that has a text column. I am splitting the DataFrame into two parts based on the value in another column. One of those parts is indexed into a gensim similarity model. The other part is then fed into the model to find the indexed text that is most similar. This involves a couple of search functions to enumerate over each item in the indexed part. With the toy data, it is fast, but with my real data, it is much too slow using apply
. Here is the code example:
import pandas as pd
import gensim
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
d = {'number': [1,2,3,4,5], 'text': ['do you like python', 'do you hate python','do you like apples','who is nelson mandela','i am not interested'], 'answer':['no','yes','no','no','yes']}
df = pd.DataFrame(data=d)
df_yes = df[df['answer']=='yes']
df_no = df[df['answer']=='no']
df_no = df_no.reset_index()
docs = df_no['text'].tolist()
genDocs = [[w.lower() for w in word_tokenize(text)] for text in docs]
dictionary = gensim.corpora.Dictionary(genDocs)
corpus = [dictionary.doc2bow(genDoc) for genDoc in genDocs]
tfidf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.MatrixSimilarity(tfidf[corpus], num_features=len(dictionary))
def search(row):
query = [w.lower() for w in word_tokenize(row)]
query_bag_of_words = dictionary.doc2bow(query)
query_tfidf = tfidf[query_bag_of_words]
return query_tfidf
def searchAll(row):
max_similarity = max(sims[search(row)])
index = [i for i, j in enumerate(sims[search(row)]) if j == max_similarity]
return max_similarity, index
df_yes = df_yes.copy()
df_yes['max_similarity'], df_yes['index'] = zip(*df_yes['text'].apply(searchAll))
I have tried converting the operations to dask dataframes to no avail, as well as python multiprocessing. How would I make these functions more efficient? Is it possible to vectorize some/all of the functions?
Upvotes: 0
Views: 247
Reputation: 54233
Your code's intent and operation is very unclear. Assuming it works, explaining the ultimate goal, and showing more example data, more example queries, and the desired results in your question could help.
Perhaps it could be improved to not repeat certain operations over and over. Some ideas could include:
doc2bow()
each row once, and cache the BOW representationsims(search[row])
twice inside searchAll()
(More generally, though, efficient text keyword search often uses specialized reverse-indexes for efficiency, to avoid a costly iteration over every document.)
Upvotes: 1