Reputation: 65
I have a pdf file that I am reading using pymupdf using the below syntax.
import fitz # this is pymupdf
with fitz.open('file.pdf') as doc:
text = ""
for page in doc:
text += page.getText()
Is there a way to ignore the header and footer while reading it?
I tried converting pdf to docx as it is easier to remove headers, but the pdf file I am working on is getting reformatted when I convert it to docx.
Is there any way pymupdf does this during the read?
Upvotes: 1
Views: 7912
Reputation: 11
The documentation has a page dedicated to this problem.
page.get_textbox(rect)
method.Source: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction#2-pageget_textboxrect
The generic solution that works for most pdf libraries is to
Upvotes: 1
Reputation: 6381
As the official document says, you can use the clip
argument to do it:
doc = fitz.open(fname)
page = doc[0]
rect = page.rect
height = 50
clip = fitz.Rect(0, height, rect.width, rect.height-height)
text = page.get_text(clip=clip)
Upvotes: 0