Reputation: 944
Is it possible to exclude the contents of footers and headers of a page
from a pdf file during extracting the text from it. As these contents are least important and almost redundant.
Note: For extracting the text from the .pdf file, I am using the PyPDF2 package on python version = 3.7.
How to exclude the contents of the footers and headers in PyPDF2. Any help is appreciated.
The code snippet is as follows:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = " "
pdfFileObj = open('C:\\Users\\Rocky\\Desktop\\req\\req\\0000 - gamma j.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
print(num_pages)
while (startPage <= endPage):
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.strip().split()
print(text)
Read(1, 1)
Upvotes: 5
Views: 10261
Reputation: 21
Similarly, I had a PDF doc with footers, the text content of which I wanted to manipulate. Annoyingly the footers would appear at the top of the text when extracted by the extract_text method of pypdf. Cropping the footer from the page does not change the result of the extract_text. Even if you crop the pages of the original and write them to a new file, extract_text applied to the new file will return the footers along with the text content.
I did find a hack that worked for me. Crop the footers from each page of the original and write them to a new PDF, let's call it TEMP.PDF. Open TEMP.PDF with Adobe Reader. Visually the footers are missing (but if you attempt extract_text on TEMP.PDF, you will find them still in the content returned).
In Adobe Reader, “Select All” of TEMP.PDF (macOS cmd A) and copy to clipboard (macOS cmd C). Paste the clipboard into a new MS Word document and voila, you get all the text content of the original sans footers. You can then artificially paginate (manually add page breaks to) the MS Word document to correspond with the original PDF pagination, then from the MS Word print menu create a new PDF document, call it PRINT.PDF.
Now use extract_text on PRINT.PDF. Initially, I had a problem with missing new-line characters, but this was solved by adding parameters to the extract_text call as follows:
plainText = page.extract_text(extraction_mode="layout", layout_mode_space_vertically=False)
Upvotes: 1
Reputation: 136339
At the moment, pypdf (and the deprecated PyPDF2) does not offer this. It's also unclear how to do it well as those are not semantically represented within the pdf
As a heuristic, you could search for duplicates in the top / bottom of the extracted text of pages. That would likely work well for long documents and not work at all for 1-page documents
You need to consider that the first few pages might have no header or a different header than the rest. Also, there can be differences between chapters and even / odd pages
Side note: I'm the maintainer of pypdf and PyPDF2 and I think this will never be inside pypdf. The reason is that it cannot be done reliably. You need some context knowledge. That makes it a good fit for machine learning, but not such a good fit for a library. People would not be happy if it worked just 80% of the time + we would constantly have to extend this.
Upvotes: 3
Reputation: 51
As there are no features provided by PyPDF2 officially, I've written a function of my own to exclude the headers and footers in a pdf page which is working fine for my use case. You can add your own Regex patterns in page_format_pattern
variable. Here I'm checking only in the first and last elements of my text list.
You can run this function for each page.
def remove_header_footer(self,pdf_extracted_text):
page_format_pattern = r'([page]+[\d]+)'
pdf_extracted_text = pdf_extracted_text.lower().split("\n")
header = pdf_extracted_text[0].strip()
footer = pdf_extracted_text[-1].strip()
if re.search(page_format_pattern, header) or header.isnumeric():
pdf_extracted_text = pdf_extracted_text[1:]
if re.search(page_format_pattern, footer) or footer.isnumeric():
pdf_extracted_text = pdf_extracted_text[:-1]
pdf_extracted_text = "\n".join(pdf_extracted_text)
return pdf_extracted_text
Hope you find this helpful.
Upvotes: 5