Wrong word count in page_content.count (PyPDF2)

Question

I wrote the following code in Python to count keywords in a pdf file, yet the word count differs from that of web browsers. Hereunder, the code is used to count instances of the word "Windows" in the Microsoft's 10-Q report of April 2020 (retrieved from: https://www.microsoft.com/en-us/Investor/sec-filings.aspx)

import PyPDF2
filepath = "Microsoft - 10-Q.pdf"
pdf_file = open(filepath, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
total_number_of_keywords = 0
for page in range(number_of_pages):
    read_page = read_pdf.getPage(page)
    page_content = read_page.extractText()
    counted_keywords_per_page = page_content.count('windows')
    total_number_of_keywords += counted_keywords_per_page
print(total_number_of_keywords)

The code outputs 0 as the number of times "windows" was mentioned. Yet, both Microsoft Edge and Google Chrome retrieve 60 instances of the word "windows".

Why?

Wrong word count in page_content.count (PyPDF2)

Answers (1)

Related Questions