Reputation: 1
I have a pdf document which contains some internal hyperlinks which lead to different subsections of the same document.
I want to parse these "GoTo" links to create something like a knowledge graph for that document. Is there any way I could read/parse these links?
I have tried using conventional PDF Readers like PyPDF2 or PdfPlumber but since they work as OCR, they cannot fetch those hyperlinks.
Upvotes: 0
Views: 106
Reputation: 3150
Here is a solution using PyMuPDF:
import pymupdf
internal_links=[] # list of links to pages in document
# interal link types: GoTo or Named
valid_types = (pymupdf.LINK_GOTO, pymupdf.LINK_NAMED)
doc=pymupdf.open("pdf-user-manual.pdf")
for page in doc:
links = [l for l in page.get_links() if l["kind"] in valid_types]
internal_links.extend(links)
Every list item is a dictionary like this one:
{'from': Rect(278.1600036621094, 383.9999694824219, 333.80999755859375, 399.5999755859375),
'id': '',
'kind': 1,
'page': 1,
'to': Point(87.0, 72.91998),
'xref': 23,
'zoom': 0.0}
Link "kinds" are the PDF values for GoTo, GoToR, Named, URI and Launch.
Note: I am a maintainer and the original creator of PyMuPDF.
Upvotes: 0
Reputation: 18136
Taking the PDF from there: https://hugepdf.com/download/user-manual-43_pdf, which has external and internal likes, you could look up for all subtypes /Links
:
from PyPDF2 import PdfReader
from pprint import pprint
# pdf: https://hugepdf.com/download/user-manual-43_pdf
with open("/tmp/pdf-user-manual.pdf", 'rb') as f:
pdf = PdfReader(f)
for i, page in enumerate(pdf.pages):
if "/Annots" in page:
for annot in page["/Annots"]:
subtype = annot.get_object()["/Subtype"]
if subtype == "/Link":
print(annot.get_object())
Out:
{'/BS': {'/W': 0},
'/Dest': [IndirectObject(24, 0, 4373945936), '/XYZ', 87, 769, 0],
'/F': 4,
'/Rect': [278.16, 442.32, 333.81, 457.92],
'/StructParent': 1,
'/Subtype': '/Link'}
{'/A': {'/S': '/URI',
'/Type': '/Action',
'/URI': 'http://www.microsoft.com/visualstudio/eng'},
'/BS': {'/W': 0},
'/F': 4,
'/Rect': [476.68, 395.52, 509.92, 411.12],
'/StructParent': 2,
'/Subtype': '/Link'}
{'/A': {'/S': '/URI',
'/Type': '/Action',
'/URI': 'http://www.microsoft.com/visualstudio/eng'},
'/BS': {'/W': 0},
'/F': 4,
'/Rect': [87.75, 379.92, 162.35, 395.52],
'/StructParent': 3,
'/Subtype': '/Link'}
Upvotes: 0