Reputation: 944

Excluding the Header and Footer Contents of a page of a PDF file while extracting text?

Is it possible to exclude the contents of footers and headers of a page from a pdf file during extracting the text from it. As these contents are least important and almost redundant.

Note: For extracting the text from the .pdf file, I am using the PyPDF2 package on python version = 3.7.

How to exclude the contents of the footers and headers in PyPDF2. Any help is appreciated.

The code snippet is as follows:

import PyPDF2

def Read(startPage, endPage):
    global text
    text = []
    cleanText = " "
    pdfFileObj = open('C:\\Users\\Rocky\\Desktop\\req\\req\\0000 - gamma j.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    print(num_pages)
    while (startPage <= endPage):
        pageObj = pdfReader.getPage(startPage)
        text += pageObj.extractText()
        startPage += 1
    pdfFileObj.close()
    for myWord in text:
        if myWord != '\n':
           cleanText += myWord
    text = cleanText.strip().split()
    print(text)

Read(1, 1)

Upvotes: 5

Answers (3)

Bill Bradbury

Reputation: 21

Similarly, I had a PDF doc with footers, the text content of which I wanted to manipulate. Annoyingly the footers would appear at the top of the text when extracted by the extract_text method of pypdf. Cropping the footer from the page does not change the result of the extract_text. Even if you crop the pages of the original and write them to a new file, extract_text applied to the new file will return the footers along with the text content.

I did find a hack that worked for me. Crop the footers from each page of the original and write them to a new PDF, let's call it TEMP.PDF. Open TEMP.PDF with Adobe Reader. Visually the footers are missing (but if you attempt extract_text on TEMP.PDF, you will find them still in the content returned).

In Adobe Reader, “Select All” of TEMP.PDF (macOS cmd A) and copy to clipboard (macOS cmd C). Paste the clipboard into a new MS Word document and voila, you get all the text content of the original sans footers. You can then artificially paginate (manually add page breaks to) the MS Word document to correspond with the original PDF pagination, then from the MS Word print menu create a new PDF document, call it PRINT.PDF.

Now use extract_text on PRINT.PDF. Initially, I had a problem with missing new-line characters, but this was solved by adding parameters to the extract_text call as follows:

plainText = page.extract_text(extraction_mode="layout", layout_mode_space_vertically=False)

Upvotes: 1

Martin Thoma

Reputation: 136795

At the moment, pypdf (and the deprecated PyPDF2) does not offer this. It's also unclear how to do it well as those are not semantically represented within the pdf

As a heuristic, you could search for duplicates in the top / bottom of the extracted text of pages. That would likely work well for long documents and not work at all for 1-page documents

You need to consider that the first few pages might have no header or a different header than the rest. Also, there can be differences between chapters and even / odd pages

Side note: I'm the maintainer of pypdf and PyPDF2 and I think this will never be inside pypdf. The reason is that it cannot be done reliably. You need some context knowledge. That makes it a good fit for machine learning, but not such a good fit for a library. People would not be happy if it worked just 80% of the time + we would constantly have to extend this.

Ideas how to identifiy the footer

Go by the position. Just define a threshold under which you assume the footer is. Then you can use visitor functions: https://pypdf2.readthedocs.io/en/3.0.0/user/extract-text.html#using-a-visitor
Try to find text patterns which are on every page at the bottom.

Upvotes: 3

Neha Duggirala

Reputation: 51

As there are no features provided by PyPDF2 officially, I've written a function of my own to exclude the headers and footers in a pdf page which is working fine for my use case. You can add your own Regex patterns in page_format_pattern variable. Here I'm checking only in the first and last elements of my text list. You can run this function for each page.

def remove_header_footer(self,pdf_extracted_text):
        page_format_pattern = r'([page]+[\d]+)'
        pdf_extracted_text = pdf_extracted_text.lower().split("\n")
        header = pdf_extracted_text[0].strip()
        footer = pdf_extracted_text[-1].strip()
        if re.search(page_format_pattern, header) or header.isnumeric():
            pdf_extracted_text = pdf_extracted_text[1:]
        if re.search(page_format_pattern, footer) or footer.isnumeric():
            pdf_extracted_text = pdf_extracted_text[:-1]
        pdf_extracted_text = "\n".join(pdf_extracted_text)
        return pdf_extracted_text

Hope you find this helpful.

Upvotes: 5

Excluding the Header and Footer Contents of a page of a PDF file while extracting text?

Answers (3)

Ideas how to identifiy the footer

Related Questions