Reputation: 351
I wrote the following code in Python to count keywords in a pdf file, yet the word count differs from that of web browsers. Hereunder, the code is used to count instances of the word "Windows" in the Microsoft's 10-Q report of April 2020 (retrieved from: https://www.microsoft.com/en-us/Investor/sec-filings.aspx)
import PyPDF2
filepath = "Microsoft - 10-Q.pdf"
pdf_file = open(filepath, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
total_number_of_keywords = 0
for page in range(number_of_pages):
read_page = read_pdf.getPage(page)
page_content = read_page.extractText()
counted_keywords_per_page = page_content.count('windows')
total_number_of_keywords += counted_keywords_per_page
print(total_number_of_keywords)
The code outputs 0 as the number of times "windows" was mentioned. Yet, both Microsoft Edge and Google Chrome retrieve 60 instances of the word "windows".
Why?
Upvotes: 0
Views: 608
Reputation: 36
The 'page_content' is empty. This is an open issue PyPDF2
You can use another PDF processor package, like PyMuPDF (using import of the fitz
module):
import re
import fitz
filepath = "Microsoft - 10-Q.pdf"
pdf_file = fitz.open(filepath)
pdf_pages = pdf_file.pageCount
full_text = []
for page in pdf_file.pages(0,pdf_pages,1):
text = str(page.getText())
full_text.append(text)
full_text = "".join(full_text)
print (len(re.findall('Windows', full_text)))
Upvotes: 1