Jayashree Sridhar
Jayashree Sridhar

Reputation: 65

python - read pdf ignoring header and footer

I have a pdf file that I am reading using pymupdf using the below syntax.

import fitz  # this is pymupdf

with fitz.open('file.pdf') as doc:

    text = ""
    for page in doc:
        text += page.getText()

Is there a way to ignore the header and footer while reading it?

I tried converting pdf to docx as it is easier to remove headers, but the pdf file I am working on is getting reformatted when I convert it to docx.

Is there any way pymupdf does this during the read?

Upvotes: 1

Views: 7912

Answers (2)

dzejms
dzejms

Reputation: 11

The documentation has a page dedicated to this problem.

  1. Define rectangle that omits the header
  2. Use page.get_textbox(rect) method.

Source: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction#2-pageget_textboxrect

The generic solution that works for most pdf libraries is to

  1. check for the size of the header/footer section in your pdf files
  2. loop for each text in the document and check it's vertical position

Upvotes: 1

Waket Zheng
Waket Zheng

Reputation: 6381

As the official document says, you can use the clip argument to do it:

doc = fitz.open(fname)
page = doc[0]
rect = page.rect
height = 50
clip = fitz.Rect(0, height, rect.width, rect.height-height)
text = page.get_text(clip=clip)

Upvotes: 0

Related Questions