Incognito
Incognito

Reputation: 351

Wrong word count in page_content.count (PyPDF2)

I wrote the following code in Python to count keywords in a pdf file, yet the word count differs from that of web browsers. Hereunder, the code is used to count instances of the word "Windows" in the Microsoft's 10-Q report of April 2020 (retrieved from: https://www.microsoft.com/en-us/Investor/sec-filings.aspx)

import PyPDF2
filepath = "Microsoft - 10-Q.pdf"
pdf_file = open(filepath, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
total_number_of_keywords = 0
for page in range(number_of_pages):
    read_page = read_pdf.getPage(page)
    page_content = read_page.extractText()
    counted_keywords_per_page = page_content.count('windows')
    total_number_of_keywords += counted_keywords_per_page
print(total_number_of_keywords)

The code outputs 0 as the number of times "windows" was mentioned. Yet, both Microsoft Edge and Google Chrome retrieve 60 instances of the word "windows".

Why?

Upvotes: 0

Views: 608

Answers (1)

pitter-patter
pitter-patter

Reputation: 36

The 'page_content' is empty. This is an open issue PyPDF2

You can use another PDF processor package, like PyMuPDF (using import of the fitz module):

import re
import fitz

filepath = "Microsoft - 10-Q.pdf"
pdf_file = fitz.open(filepath)
pdf_pages = pdf_file.pageCount

full_text = []
for page in pdf_file.pages(0,pdf_pages,1):
    text = str(page.getText())
    full_text.append(text)
full_text = "".join(full_text)
print (len(re.findall('Windows', full_text)))

Upvotes: 1

Related Questions