Extract PDF Pages based on Header text in Python

Question

I have an annual report pdf of 'Asian Paints Ltd'. I want to extract the 'Consolidated Balance Sheet Page' (which is the 216th page in the PDF). I've used PyPDF and created a function that extracts all the text, searches for a key term 'Consolidated Balance Sheet', and returns the page number where it is found.

But, I want my function to recognize the one page where it contains the word 'Consolidated Balance Sheet' as a Header with the required table (aka 216th Page in this pdf).

Here is my code:

import PyPDF2
import re

def extract_page_num(keyTerm):
    
    # Open pdf
    object = PyPDF2.PdfFileReader("D:\AR_18126_ASIANPAINT_2020_2021_07062021194954.pdf")

    # get number of pages
    NumPages = object.getNumPages()

    # extract page number
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        Text = Text.replace('˜','fi')
        reSearch = re.findall(keyTerm, Text)
        if reSearch:
            #print("Page Number is " + str(i))
            #print(reSearch)
            return i
            
bs_no = extract_page_num('Consolidated Balance Sheet')

Here is the link to the annual report: https://www.bseindia.com/bseplus/AnnualReport/500820/68521500820.pdf

Thank you in advance for taking out time to solve my query!

Extract PDF Pages based on Header text in Python

Answers (1)

Related Questions