Strayhorn
Strayhorn

Reputation: 729

Document Similarity runtime exceeds using Spacy

I have written a function in Python to compute the similarity between PDF pages to return the most similar page mapping.

Function gets input the file and a list which has dictionary entries as: thumbnail=[{'page': 1, 'text' : 'strin1'}, {'page': 2, 'text' : 'strin2'},...]

The function:

import PyPDF2
import spacy

filename2="file.pdf"
nlp = spacy.load('en_core_web_lg')


def checker(filename2, thumbnail):
    object = PyPDF2.PdfFileReader(filename2)
    NumPages = object.getNumPages()
    
    specialCharacters = {ord(c): " " for c in "!@#$%^&*()[]{};:,./<>?\|`~-=_+"}
    
    # extract text and do the search
    output=[]
    for i in range(0, NumPages):
        temp_dict={}
        Text = object.getPage(i).extractText().translate(specialCharacters)
        Text=Text.replace('\n','')
        
        for item in thumbnail:
            sim= nlp(Text).similarity(nlp(item['text']))
            if sim>0.98:
                temp_dict['page_thumbnail'] = item['page']
                temp_dict['page_file']=i+1
                temp_dict['sim'] = sim
                output.append(temp_dict)
    return output

This is taking a really long time for a PDF with 38 pages matched with a list of 38 entries using Spacy. Any suggestion on how to make it scalable? Also the primary goal is to return the page number of the document (i) and the matched page for which the similarity score is the highest in the thumbnail (item['page']).

Upvotes: 1

Views: 151

Answers (1)

polm23
polm23

Reputation: 15633

You are calling nlp too much, specifically NumPages * len(thumbnail) times. Every call is expensive. You need to call stuff up front so you don't call it repeatedly.

Do this:

# do this right at the start of your function
tdocs = [nlp(ii['text']) for ii in thumbnail]

# ... later on ...

Text=Text.replace('\n','')
doc = nlp(Text)
        
for item, tdoc in zip(thumbnail, tdocs):
    sim = doc.similarity(tdoc)

That should make it much faster. If that's still not fast enough you should pre-compute the vectors and stash them in something like annoy so you can approximate lookups.

Upvotes: 2

Related Questions