Reputation: 65
I'm trying to read a specific region on a PDF file. How to do it?
I've tried:
Upvotes: 3
Views: 4361
Reputation: 11178
PyMuPDF can probably do this.
I just answered another question regarding getting the "highlighted text" from a page, but the solution uses the same relevant parts of the PyMuPDF API you want:
and I say "probably" because I haven't actually tried it on your PDF, so I cannot say for certain that the text is amenable to this process.
import os.path
import fitz
from fitz import Document, Page, Rect
# For visualizing the rects that PyMuPDF uses compared to what you see in the PDF
VISUALIZE = True
input_path = "test.pdf"
doc: Document = fitz.open(input_path)
for i in range(len(doc)):
page: Page = doc[i]
page.clean_contents() # https://pymupdf.readthedocs.io/en/latest/faq.html#misplaced-item-insertions-on-pdf-pages
# Hard-code the rect you need
rect = Rect(0, 0, 100, 100)
if VISUALIZE:
# Draw a red box to visualize the rect's area (text)
page.draw_rect(rect, width=1.5, color=(1, 0, 0))
text = page.get_textbox(rect)
print(text)
if VISUALIZE:
head, tail = os.path.split(input_path)
viz_name = os.path.join(head, "viz_" + tail)
doc.save(viz_name)
For context, here's the project I just finished where this was working for the highlighted text, https://github.com/zacharysyoung/extract_highlighted_text.
Upvotes: 3
Reputation: 65
Using Zach Young's answer, this is the final code:
def get_data():
# INPUT
pdf_in = '_SPTs.pdf'
# Rectangles defining data to be extracted
furo_rect = Rect(506, 115, 549, 128)
spt_rect = Rect(388, 201, 422, 677)
na_rect = Rect(464, 760, 501, 767)
# fitz Document
doc: Document = fitz.open(pdf_in)
# Pages loop
spt_data = []
for i in range(len(doc)):
page: Page = doc[i]
furo = page.get_textbox(furo_rect)
spt = page.get_textbox(spt_rect).splitlines()
na = page.get_textbox(na_rect)
spt_data.append([furo, spt, na])
print(f'Furo: {furo} | SPT: {spt} | NA: {na}')
# Export values to Excel with some data handling
workbook = xlsxwriter.Workbook('_SPTs_pymu.xlsx')
worksheet = workbook.add_worksheet()
for i,data in enumerate(spt_data):
worksheet.write(i, 0, data[0])
for j in range(len(data[1])):
try:
spt_value = float(data[1][j])
except:
if data[1][j] == '-':
spt_value = 0
else:
spt_value = data[1][j]
worksheet.write(i,j+1,spt_value)
try:
na_value = float(data[2])
except:
na_value = data[2]
worksheet.write(i,19,na_value)
workbook.close()
return
Upvotes: 1
Reputation: 11728
By far the easiest method when you have accurate text data at given co-ordinates is to extract as a number of viewports It could be done as 1 viewport and filter the lines not required but in this case its easier to extract as 3 text windows and combine into one text output.
Here I am using xpdf 4.04 as for me easiest to define the co-ordinates visually, but you can do very similar in python by OS shelling the included poppler version of pdftotext, which uses an -x -y -W -H syntax. Note you may need to tweak my values to suite your actual layouts, So check a few edge cases to over compensate without capturing surrounding entries.
Upvotes: 1