Shivam Goswami
Shivam Goswami

Reputation: 1

Is there a way to parse self-contained links inside a PDF through python?

I have a pdf document which contains some internal hyperlinks which lead to different subsections of the same document.

I want to parse these "GoTo" links to create something like a knowledge graph for that document. Is there any way I could read/parse these links?

I have tried using conventional PDF Readers like PyPDF2 or PdfPlumber but since they work as OCR, they cannot fetch those hyperlinks.

Upvotes: 0

Views: 106

Answers (2)

Jorj McKie
Jorj McKie

Reputation: 3150

Here is a solution using PyMuPDF:

import pymupdf

internal_links=[]  # list of links to pages in document

# interal link types: GoTo or Named
valid_types = (pymupdf.LINK_GOTO, pymupdf.LINK_NAMED)

doc=pymupdf.open("pdf-user-manual.pdf")
for page in doc:
    links = [l for l in page.get_links() if l["kind"] in valid_types]
    internal_links.extend(links)

Every list item is a dictionary like this one:

{'from': Rect(278.1600036621094, 383.9999694824219, 333.80999755859375, 399.5999755859375),
  'id': '',
  'kind': 1,
  'page': 1,
  'to': Point(87.0, 72.91998),
  'xref': 23,
  'zoom': 0.0}

Link "kinds" are the PDF values for GoTo, GoToR, Named, URI and Launch.

Note: I am a maintainer and the original creator of PyMuPDF.

Upvotes: 0

Maurice Meyer
Maurice Meyer

Reputation: 18136

Taking the PDF from there: https://hugepdf.com/download/user-manual-43_pdf, which has external and internal likes, you could look up for all subtypes /Links:

from PyPDF2 import PdfReader
from pprint import pprint

# pdf: https://hugepdf.com/download/user-manual-43_pdf
with open("/tmp/pdf-user-manual.pdf", 'rb') as f:
    pdf = PdfReader(f)

    for i, page in enumerate(pdf.pages):
        if "/Annots" in page:
            for annot in page["/Annots"]:
                subtype = annot.get_object()["/Subtype"]
                if subtype == "/Link":
                    print(annot.get_object())

Out:

{'/BS': {'/W': 0},
 '/Dest': [IndirectObject(24, 0, 4373945936), '/XYZ', 87, 769, 0],
 '/F': 4,
 '/Rect': [278.16, 442.32, 333.81, 457.92],
 '/StructParent': 1,
 '/Subtype': '/Link'}
{'/A': {'/S': '/URI',
        '/Type': '/Action',
        '/URI': 'http://www.microsoft.com/visualstudio/eng'},
 '/BS': {'/W': 0},
 '/F': 4,
 '/Rect': [476.68, 395.52, 509.92, 411.12],
 '/StructParent': 2,
 '/Subtype': '/Link'}
{'/A': {'/S': '/URI',
        '/Type': '/Action',
        '/URI': 'http://www.microsoft.com/visualstudio/eng'},
 '/BS': {'/W': 0},
 '/F': 4,
 '/Rect': [87.75, 379.92, 162.35, 395.52],
 '/StructParent': 3,
 '/Subtype': '/Link'}

Upvotes: 0

Related Questions