Patty
Patty

Reputation: 65

Speeding up Python NLP text parsing

I have a dataset consisting of large strings (extracted text from ~300 pptx files). By using pandas apply I am executing an "average" function on each string, the averaging looks up a corresponding word vector for every word, multiplies it with another vector and returns the average correlation.

However iterating and applying the function on large strings takes a lot of time, and I was wondering what approaches I could take to speed up the following code:

#retrieve word vector from words df
def vec(w):
     return words.at[w]

#calculates the cosine distance between two vectors
def cosine_dist(a,b):
    codi = 1 - spatial.distance.cosine(a, b)
    return codi

#calculate the average cosine distance of the whole string and a given word vector
v_search = vec("test")
def Average(v_search, tobe_parsed):
    word_total = 0
    mean = 0
    for word in tobe_parsed.split():
        try: #word exists
            cd = cosine_dist(vec(word), v_search)
            mean += cd
            word_total += 1 

        except: #word does not exists    
            pass

    average = mean / word_total
    return(average)
df['average'] = df['text'].apply(lambda x: average(x))

I've been looking into alternative ways of writing the code (e.g. df.loc -> df.at), cython and multithreading, but my time is limited so I don't want to waste too much time on a less effective approach.

Thanks in advance

Upvotes: 1

Views: 141

Answers (2)

Patty
Patty

Reputation: 65

Thanks a lot vumaasha! That was indeed the way to go (speed increase from ~15 min to ~7 sec! :o)

basically the code has been rewritten to:

def Average(v_search,text):
        wordvec_matrix = words.loc[text.split()]
        return np.sum(cos_cdist(wordvec_matrix,v_search))/wordvec_matrix.shape[0]
df['average'] = df['text'].apply(lambda x: average(x))

Upvotes: 1

vumaasha
vumaasha

Reputation: 2845

You need to leverage vectorization and numpy broadcasting. Make the pandas return list of word indices, use them to index the vocabulary array and create a matrix of word vectors (number of rows equals number of words) then you use broadcasting to compute cosine distances and compute it's mean.

Upvotes: 2

Related Questions