Damiano Shehaj
Damiano Shehaj

Reputation: 1

How to retrieve a specific part of text from a PDF knowing the respective coordinates?

I need help for a python script that I am writing. It will handle some tasks regarding PDF-s. Now I am trying to retrieve a specific part of a text from a PDF by having its text coordinates and I can't find a way to do it. I have checked different libraries like PyPDF2 and pdfminer but nothing.

The library PyMuPDF, more specifically the module "fitz.py", offers the possibility to do the opposite: by taking a string as a parameter it returns the coordinates of each occurrence of this string from any page of our PDF file.

#fitz.py usage example

doc = fitz.Document("pdf_name .pdf")
page_mupdf = doc.loadPage(0)
areas = page_mupdf.searchFor("text_to_search", hit_max=16)
print(areas)

[Rect(90.0, 145.8567657470703, 142.13255310058594, 156.50209045410156)]

Upvotes: 0

Views: 1086

Answers (1)

Manuel
Manuel

Reputation: 113

When you have the page in text try using regex functions:

import re
doc = fitz.Document("pdf_name .pdf")
page_mupdf = doc.loadPage(0)
text_to_find = re.search(("text_to_search"), page_mupdf)
print(text_to_find[0])

Upvotes: -2

Related Questions