Speeding up Python NLP text parsing

Question

I have a dataset consisting of large strings (extracted text from ~300 pptx files). By using pandas apply I am executing an "average" function on each string, the averaging looks up a corresponding word vector for every word, multiplies it with another vector and returns the average correlation.

However iterating and applying the function on large strings takes a lot of time, and I was wondering what approaches I could take to speed up the following code:

#retrieve word vector from words df
def vec(w):
     return words.at[w]

#calculates the cosine distance between two vectors
def cosine_dist(a,b):
    codi = 1 - spatial.distance.cosine(a, b)
    return codi

#calculate the average cosine distance of the whole string and a given word vector
v_search = vec("test")
def Average(v_search, tobe_parsed):
    word_total = 0
    mean = 0
    for word in tobe_parsed.split():
        try: #word exists
            cd = cosine_dist(vec(word), v_search)
            mean += cd
            word_total += 1 

        except: #word does not exists    
            pass

    average = mean / word_total
    return(average)
df['average'] = df['text'].apply(lambda x: average(x))

I've been looking into alternative ways of writing the code (e.g. df.loc -> df.at), cython and multithreading, but my time is limited so I don't want to waste too much time on a less effective approach.

Thanks in advance

vumaasha · Accepted Answer

You need to leverage vectorization and numpy broadcasting. Make the pandas return list of word indices, use them to index the vocabulary array and create a matrix of word vectors (number of rows equals number of words) then you use broadcasting to compute cosine distances and compute it's mean.

Speeding up Python NLP text parsing

Answers (2)

Related Questions