Reputation: 729
I have written a function in Python to compute the similarity between PDF pages to return the most similar page mapping.
Function gets input the file and a list which has dictionary entries as: thumbnail=[{'page': 1, 'text' : 'strin1'}, {'page': 2, 'text' : 'strin2'},...]
The function:
import PyPDF2
import spacy
filename2="file.pdf"
nlp = spacy.load('en_core_web_lg')
def checker(filename2, thumbnail):
object = PyPDF2.PdfFileReader(filename2)
NumPages = object.getNumPages()
specialCharacters = {ord(c): " " for c in "!@#$%^&*()[]{};:,./<>?\|`~-=_+"}
# extract text and do the search
output=[]
for i in range(0, NumPages):
temp_dict={}
Text = object.getPage(i).extractText().translate(specialCharacters)
Text=Text.replace('\n','')
for item in thumbnail:
sim= nlp(Text).similarity(nlp(item['text']))
if sim>0.98:
temp_dict['page_thumbnail'] = item['page']
temp_dict['page_file']=i+1
temp_dict['sim'] = sim
output.append(temp_dict)
return output
This is taking a really long time for a PDF with 38 pages matched with a list of 38 entries using Spacy. Any suggestion on how to make it scalable? Also the primary goal is to return the page number of the document (i) and the matched page for which the similarity score is the highest in the thumbnail (item['page']).
Upvotes: 1
Views: 151
Reputation: 15633
You are calling nlp
too much, specifically NumPages * len(thumbnail)
times. Every call is expensive. You need to call stuff up front so you don't call it repeatedly.
Do this:
# do this right at the start of your function
tdocs = [nlp(ii['text']) for ii in thumbnail]
# ... later on ...
Text=Text.replace('\n','')
doc = nlp(Text)
for item, tdoc in zip(thumbnail, tdocs):
sim = doc.similarity(tdoc)
That should make it much faster. If that's still not fast enough you should pre-compute the vectors and stash them in something like annoy so you can approximate lookups.
Upvotes: 2