Reputation: 65
I have a dataset consisting of large strings (extracted text from ~300 pptx files). By using pandas apply I am executing an "average" function on each string, the averaging looks up a corresponding word vector for every word, multiplies it with another vector and returns the average correlation.
However iterating and applying the function on large strings takes a lot of time, and I was wondering what approaches I could take to speed up the following code:
#retrieve word vector from words df
def vec(w):
return words.at[w]
#calculates the cosine distance between two vectors
def cosine_dist(a,b):
codi = 1 - spatial.distance.cosine(a, b)
return codi
#calculate the average cosine distance of the whole string and a given word vector
v_search = vec("test")
def Average(v_search, tobe_parsed):
word_total = 0
mean = 0
for word in tobe_parsed.split():
try: #word exists
cd = cosine_dist(vec(word), v_search)
mean += cd
word_total += 1
except: #word does not exists
pass
average = mean / word_total
return(average)
df['average'] = df['text'].apply(lambda x: average(x))
I've been looking into alternative ways of writing the code (e.g. df.loc -> df.at), cython and multithreading, but my time is limited so I don't want to waste too much time on a less effective approach.
Thanks in advance
Upvotes: 1
Views: 141
Reputation: 65
Thanks a lot vumaasha! That was indeed the way to go (speed increase from ~15 min to ~7 sec! :o)
basically the code has been rewritten to:
def Average(v_search,text):
wordvec_matrix = words.loc[text.split()]
return np.sum(cos_cdist(wordvec_matrix,v_search))/wordvec_matrix.shape[0]
df['average'] = df['text'].apply(lambda x: average(x))
Upvotes: 1
Reputation: 2845
You need to leverage vectorization and numpy broadcasting. Make the pandas return list of word indices, use them to index the vocabulary array and create a matrix of word vectors (number of rows equals number of words) then you use broadcasting to compute cosine distances and compute it's mean.
Upvotes: 2