Document Similarity runtime exceeds using Spacy

Question

I have written a function in Python to compute the similarity between PDF pages to return the most similar page mapping.

Function gets input the file and a list which has dictionary entries as: thumbnail=[{'page': 1, 'text' : 'strin1'}, {'page': 2, 'text' : 'strin2'},...]

The function:

import PyPDF2
import spacy

filename2="file.pdf"
nlp = spacy.load('en_core_web_lg')


def checker(filename2, thumbnail):
    object = PyPDF2.PdfFileReader(filename2)
    NumPages = object.getNumPages()
    
    specialCharacters = {ord(c): " " for c in "!@#$%^&*()[]{};:,./<>?\|`~-=_+"}
    
    # extract text and do the search
    output=[]
    for i in range(0, NumPages):
        temp_dict={}
        Text = object.getPage(i).extractText().translate(specialCharacters)
        Text=Text.replace('
','')
        
        for item in thumbnail:
            sim= nlp(Text).similarity(nlp(item['text']))
            if sim>0.98:
                temp_dict['page_thumbnail'] = item['page']
                temp_dict['page_file']=i+1
                temp_dict['sim'] = sim
                output.append(temp_dict)
    return output

This is taking a really long time for a PDF with 38 pages matched with a list of 38 entries using Spacy. Any suggestion on how to make it scalable? Also the primary goal is to return the page number of the document (i) and the matched page for which the similarity score is the highest in the thumbnail (item['page']).

Document Similarity runtime exceeds using Spacy

Answers (1)

Related Questions