Vansh
Vansh

Reputation: 53

Extract PDF Pages based on Header text in Python

I have an annual report pdf of 'Asian Paints Ltd'. I want to extract the 'Consolidated Balance Sheet Page' (which is the 216th page in the PDF). I've used PyPDF and created a function that extracts all the text, searches for a key term 'Consolidated Balance Sheet', and returns the page number where it is found.

But, I want my function to recognize the one page where it contains the word 'Consolidated Balance Sheet' as a Header with the required table (aka 216th Page in this pdf).

Here is my code:

import PyPDF2
import re

def extract_page_num(keyTerm):
    
    # Open pdf
    object = PyPDF2.PdfFileReader("D:\AR_18126_ASIANPAINT_2020_2021_07062021194954.pdf")

    # get number of pages
    NumPages = object.getNumPages()

    # extract page number
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        Text = Text.replace('˜','fi')
        reSearch = re.findall(keyTerm, Text)
        if reSearch:
            #print("Page Number is " + str(i))
            #print(reSearch)
            return i
            
bs_no = extract_page_num('Consolidated Balance Sheet')

Here is the link to the annual report: https://www.bseindia.com/bseplus/AnnualReport/500820/68521500820.pdf

Thank you in advance for taking out time to solve my query!

Upvotes: 3

Views: 7995

Answers (1)

moken
moken

Reputation: 6639

Option 1 Without going to the extent of extracting formatting information, perhaps just extending your search pattern to make it more unique will help. For example you can look at the extracted text for the page and see it is near the start preceded by a [page] number and followed by '\nas at' and a date. I found the correct page was located just by changing the search to;

bs_no = extract_page_num('Consolidated Balance Sheet\nas at')

However that may not be exact enough for all cases.

Option2 If you want to use formatting like the font size/name you can try pdfplumber. This type of searching would be longer.

I tried with both font name and font size; font is not consistent on both pdf neither is font size. The font size would always larger for the Header so it may all that's necessary to use.

The following code extracts words with format data I've included font size/name in the extracted information then extracted text that has a font size between 20 and 24 point, of the example pdfs one was 22 and the other 20, then searched the text for the keyTerm.

This works for both pdf examples you gave however again it may need tweaking or other criteria included to be completely effective.

import json
import pdfplumber
import re


pdf1 = '68521500820.pdf'
pdf2 = '68366500425.pdf'
keyTerm = 'Consolidated Balance Sheet'
with pdfplumber.open(pdf1) as pdf:

    for i in range(len(pdf.pages)):
        word_data = pdf.pages[i].extract_words(extra_attrs = ['fontname', 'size'])
        json_text = json.loads(json.dumps(word_data))

        head_text = ''
        for wt in range(len(json_text)):
            if 20 <= json_text[wt]['size'] <= 24:
                head_text += json_text[wt]['text'] + ' '

        if re.findall(keyTerm, head_text.strip()):
            print("Page " + str(i + 1))
            print(head_text)


print("")

Upvotes: 1

Related Questions