GCBrgt
GCBrgt

Reputation: 65

Read specific region from PDF

I'm trying to read a specific region on a PDF file. How to do it?

I've tried:

  1. Using PyPDF2, cropped the PDF page and read only that. It doesn't work because PyPDF2's cropbox only shrinks the "view", but keeps all the items outside the specified cropbox. So on reading the cropped pdf text with extract_text(), it reads all the "invisible" contents, not only the cropped part.
  2. Converting the PDF page to PNG, cropping it and using Pytesseract to read the PNG. Py tesseract doesn't work properly, don't know why.

Upvotes: 3

Views: 4361

Answers (3)

Zach Young
Zach Young

Reputation: 11178

PyMuPDF can probably do this.

I just answered another question regarding getting the "highlighted text" from a page, but the solution uses the same relevant parts of the PyMuPDF API you want:

  • figure out a rectangle that defines the area of interest
  • extract text based on that rectangle

and I say "probably" because I haven't actually tried it on your PDF, so I cannot say for certain that the text is amenable to this process.

import os.path

import fitz
from fitz import Document, Page, Rect


# For visualizing the rects that PyMuPDF uses compared to what you see in the PDF
VISUALIZE = True

input_path = "test.pdf"
doc: Document = fitz.open(input_path)

for i in range(len(doc)):
    page: Page = doc[i]
    page.clean_contents()  # https://pymupdf.readthedocs.io/en/latest/faq.html#misplaced-item-insertions-on-pdf-pages

    # Hard-code the rect you need
    rect = Rect(0, 0, 100, 100)

    if VISUALIZE:
        # Draw a red box to visualize the rect's area (text)
        page.draw_rect(rect, width=1.5, color=(1, 0, 0))

    text = page.get_textbox(rect)

    print(text)


if VISUALIZE:
    head, tail = os.path.split(input_path)
    viz_name = os.path.join(head, "viz_" + tail)
    doc.save(viz_name)

For context, here's the project I just finished where this was working for the highlighted text, https://github.com/zacharysyoung/extract_highlighted_text.

Upvotes: 3

GCBrgt
GCBrgt

Reputation: 65

Using Zach Young's answer, this is the final code:

def get_data():

# INPUT
pdf_in = '_SPTs.pdf'

# Rectangles defining data to be extracted
furo_rect = Rect(506, 115, 549, 128)
spt_rect = Rect(388, 201, 422, 677)
na_rect = Rect(464, 760, 501, 767)

# fitz Document
doc: Document = fitz.open(pdf_in)

# Pages loop
spt_data = []
for i in range(len(doc)):
    page: Page = doc[i]
    furo = page.get_textbox(furo_rect)
    spt = page.get_textbox(spt_rect).splitlines()
    na = page.get_textbox(na_rect)
    spt_data.append([furo, spt, na])
    print(f'Furo: {furo} | SPT: {spt} | NA: {na}')


# Export values to Excel with some data handling
workbook = xlsxwriter.Workbook('_SPTs_pymu.xlsx')
worksheet = workbook.add_worksheet()
for i,data in enumerate(spt_data):
    worksheet.write(i, 0, data[0])
    for j in range(len(data[1])):
        try:
            spt_value = float(data[1][j])
        except:
            if data[1][j] == '-':
                spt_value = 0
            else:
                spt_value = data[1][j]
        worksheet.write(i,j+1,spt_value)
    try:
        na_value = float(data[2])
    except:
        na_value = data[2]
    worksheet.write(i,19,na_value)
workbook.close()

return

Upvotes: 1

K J
K J

Reputation: 11728

By far the easiest method when you have accurate text data at given co-ordinates is to extract as a number of viewports It could be done as 1 viewport and filter the lines not required but in this case its easier to extract as 3 text windows and combine into one text output.

Here I am using xpdf 4.04 as for me easiest to define the co-ordinates visually, but you can do very similar in python by OS shelling the included poppler version of pdftotext, which uses an -x -y -W -H syntax. Note you may need to tweak my values to suite your actual layouts, So check a few edge cases to over compensate without capturing surrounding entries.

enter image description here

Upvotes: 1

Related Questions