Justin Erswell
Justin Erswell

Reputation: 708

PyMuPDF (Fitz) QuadPoints for re-use in the Adobe Embed API

I am trying to extract annotations from a PDF and then use that data to 'Cherry Pick' the annotations we require to display them in a clean version of the PDF using the Adobe Embed API. We are getting data fine from the PDF using PyMuPDF however when we apply the annotation to the clean PDF we are getting weird results.

First parse the Annotated PDF - Sandborn2003Annotated.pdf

Then using the Adobe Document Cloud Embed API - https://developer.adobe.com/document-services/docs/overview/pdf-embed-api/ apply the first Annotation in the stored json to a clean version of the file Sandborn2003.pdf


*** Edit *** This is the outputted file Sandborn 2003 (1).pdf


I am expecting to see the clean PDF take on the same annotation as the source

Screenshots of Before (Source) and After (Target) Source Annotated Document screenshot_2023-03-29_at_20 59 45_720

Target PDF after applying the annotation screenshot_2023-03-29_at_20 59 45_720

My Environment

This my code

import fitz
import json
import sys
from datetime import datetime

if len(sys.argv) != 2:
    print('Usage: python extractPDFAnnotations.py <filename>')
    sys.exit(1)
filename = sys.argv[1]

doc = fitz.open('Sandborn2003Annotated.pdf')
annotations = []

for page_num, page in enumerate(doc):
    for annot in page.annots():
        annotation_data = {}
        target_data = {}
        selector_data = {}

        # Common properties for all annotation types
        annotation_data["@context"] = [
            "https://www.w3.org/ns/anno.jsonld",
            "https://comments.acrobat.com/ns/anno.jsonld",
        ]
        annotation_data["id"] = annot.info.get("id", "")
        annotation_data["type"] = "Annotation"
        annotation_data["motivation"] = "commenting"
        annotation_data["bodyValue"] = annot.info.get("content", "")

        # Target properties
        # Replace this with the actual source identifier
        target_data["selector"] = selector_data
        annotation_data["target"] = target_data

        # Selector properties
        selector_data["node"] = {"index": page_num}
        selector_data["type"] = "AdobeAnnoSelector"

        if annot.type[0] == 8:  # Highlight annotation
            all_coordinates = annot.vertices
            highlights = []

            for i in range(0, len(all_coordinates), 4):
                quad = all_coordinates[i:i + 4]
                highlight_coord = fitz.Quad(quad).rect
                highlights.extend(
                    [highlight_coord.x0, highlight_coord.y0, highlight_coord.x1, highlight_coord.y1])

            selector_data["quadPoints"] = highlights
            # Adjust the opacity value as needed
            selector_data["opacity"] = 0.4
            selector_data["subtype"] = "highlight"
            selector_data["boundingBox"] = [
                annot.rect.x0,
                annot.rect.y0,
                annot.rect.x1,
                annot.rect.y1,
            ]
            # Adjust the stroke color as needed
            selector_data["strokeColor"] = "#fccb00"
            # Adjust the stroke width as needed
            selector_data["strokeWidth"] = 3

        # Creator properties
        annotation_data["creator"] = {
            # Replace this with the actual creator's name
            "name": annot.info.get("title", ""),
            "type": "Person",
        }

        annotations.append(annotation_data)

# Save the annotations to a JSON file
with open(filename, 'w') as f:
    json.dump(annotations, f, indent=4)

print(f'Annotations saved to {filename}')

Upvotes: -2

Views: 290

Answers (1)

johnwhitington
johnwhitington

Reputation: 2763

The original annotation is:

{
      "/Veeva.Vault.Annot": {
        "U": "{\"instanceId\":1296,\"docVersionId\":223360,\"annotateKeyDate\":\"2020-06-30\",\"annotateKeyCode\":\"eM3jyxyu\",\"noteId\":\"1593513852533\",\"userId\":4744685}"
      },
      "/F": { "I": 128 },
      "/C": 155,
      "/Contents": {
        "U": "Anchor Name: Definition of fistula - Alofisel Scientific Communications Platform (2.0) p.2, Alofisel Scientific Communications Platform (2.0) p.8, Alofisel Scientific Platform (0.7) p.2, Alofisel Scientific Platform - Pillar 1 - Unmet needs and burden of disease (1.0) p.2, Alofisel Scientific Platform (0.6) p.8, Alofisel Scientific Platform (0.5) p.2, Alofisel Scientific Platform (0.5) p.8, Alofisel Scientific Communications Platform (0.10) p.8, Alofisel Scientific Platform (0.7) p.8, Voiceover script for animation (0.4) p.1, Alofisel Scientific Platform - Pillar 1 (0.3) p.2, Alofisel Scientific Communications Platform (1.0) p.2, Voiceover script for animation (0.2) p.1, Alofisel Scientific Communications platform - glossary (2.0) p.8, Alofisel Scientific Communications Platform (1.0) p.8, Alofisel Scientific Platform (0.3) p.2, Voiceover script for animation (0.5) p.1, Alofisel Scientific Platform - Pillar 1 - Unmet needs and burden of disease (1.0) p.2, Darvadstrocel_RWE-Daten_22 (0.2) p.5, Darvadstrocel_RWE-Daten_22 (1.0) p.5, Alofisel Scientific Platform (0.6) p.2, Crohn's perianal fistulas - Pathophysiology animation for congress use (1.0) p.1, Alofisel Scientific Platform (0.8) p.8, Voiceover script for animation (0.5) p.1, Alofisel Scientific Platform (0.8) p.2, Voiceover script for animation (1.0) p.1, Alofisel Scientific Platform (0.4) p.8, Voiceover script for animation (1.0) p.1, Alofisel Scientific Platform - Pillar 1 (0.4) p.2, Alofisel Scientific Platform (0.2) p.2, Crohn's perianal fistulas - Pathophysiology animation for congress use (0.2) p.1, Alofisel Scientific Platform (0.4) p.2, Darvadstrocel_RWE-Daten_22 (0.3) p.5, Voiceover script for animation (0.3) p.1, Voiceover script for animation (0.4) p.1, Alofisel Scientific Platform - Pillar 1 (0.2) p.2, Alofisel Scientific Platform (0.2) p.5, Alofisel Scientific Platform (0.9) p.8, Alofisel Scientific Communications Platform (0.10) p.2, Alofisel Scientific Platform - Pillar 1 (0.4) p.2, Alofisel Scientific Platform (0.3) p.5, Alofisel Scientific Platform - Pillar 1 (0.3) p.2, Alofisel Scientific Platform (0.9) p.2"
      },
      "/M": { "U": "D:20200630104413+00'00'" },
      "/NM": { "U": "1593513852533" },
      "/CA": { "F": 1.0 },
      "/Subj": {
        "U": "A perianal ï¬stula (Latin for pipe ) is a chronic track of granulation tissue connecting 2 epithe- lial lined surfaces .1"
      },
      "/T": { "U": "Kate Herring" },
      "/Rotate": { "I": 0 },
      "/Rect": { "F": 315.676 }, { "F": 282.183 }, { "F": 394.806 }, { "F": 291.863 },
      "/QuadPoints":    [
      { "F": 396.603 },
      { "F": 317.863 },
      { "F": 555.722 },
      { "F": 317.863 },
      { "F": 396.603 },
      { "F": 308.183 },
      { "F": 555.722 },
      { "F": 308.183 },
      { "F": 315.676 },
      { "F": 304.863 },
      { "F": 555.7329999999999 },
      { "F": 304.863 },
      { "F": 315.676 },
      { "F": 295.183 },
      { "F": 555.7329999999999 },
      { "F": 295.183 },
      { "F": 315.676 },
      { "F": 291.863 },
      { "F": 394.806 },
      { "F": 291.863 },
      { "F": 315.676 },
      { "F": 282.183 },
      { "F": 394.806 },
      { "F": 282.183 }
    ]

      "/Subtype": { "N": "/Highlight" },
      "/Type": { "N": "/Annot" }
    }

The re-added annotation is:

    {
      "/T": { "U": "Kate Herring" },
      "/CA": { "I": 1 },
      "/CreationDate": { "U": "D:20230329191605Z00'00" },
      "/QuadPoints": [
        { "F": 387.666657 },
        { "I": 483 },
        { "I": 547 },
        { "I": 483 },
        { "F": 387.666657 },
        { "I": 493 },
        { "I": 547 },
        { "I": 493 },
        { "I": 307 },
        { "F": 496.333344 },
        { "I": 547 },
        { "F": 496.333344 },
        { "I": 307 },
        { "F": 505.666657 },
        { "I": 547 },
        { "F": 505.666657 },
        { "I": 307 },
        { "I": 509 },
        { "F": 385.666657 },
        { "I": 509 },
        { "I": 307 },
        { "I": 519 },
        { "F": 385.666657 },
        { "I": 519 }
      ],
      "/AP": { "/N": 351 },
      "/Popup": { "I": 1 },
      "/C": [ { "F": 0.972549 }, { "F": 0.819608 }, { "F": 0.278431 } ],
      "/Rect": [
        { "F": 306.6875 },
        { "F": 482.6875 },
        { "F": 547.3125 },
        { "F": 519.3125 }
      ],
      "/M": { "U": "D:20230329191605Z00'00" },
      "/F": { "I": 4 },
      "/P": { "I": 1 },
      "/Type": { "N": "/Annot" },
      "/Subtype": { "N": "/Highlight" }
    }

As you can see, the /QuadPoints and the (older, less accurate) /Rect are wrong. What is equally noticeable is that several other things are different. In other words, the round-trip process seems to have happened at a higher level of abstraction, or with a tool which likes to change things.

So I'm afraid I don't have an answer for you, but I hope this extra information helps in some way.

Upvotes: 2

Related Questions