Reputation: 708
I am trying to extract annotations from a PDF and then use that data to 'Cherry Pick' the annotations we require to display them in a clean version of the PDF using the Adobe Embed API. We are getting data fine from the PDF using PyMuPDF however when we apply the annotation to the clean PDF we are getting weird results.
First parse the Annotated PDF - Sandborn2003Annotated.pdf
Then using the Adobe Document Cloud Embed API - https://developer.adobe.com/document-services/docs/overview/pdf-embed-api/ apply the first Annotation in the stored json to a clean version of the file Sandborn2003.pdf
*** Edit *** This is the outputted file Sandborn 2003 (1).pdf
I am expecting to see the clean PDF take on the same annotation as the source
Screenshots of Before (Source) and After (Target)
Source Annotated Document
Target PDF after applying the annotation
This my code
import fitz
import json
import sys
from datetime import datetime
if len(sys.argv) != 2:
print('Usage: python extractPDFAnnotations.py <filename>')
sys.exit(1)
filename = sys.argv[1]
doc = fitz.open('Sandborn2003Annotated.pdf')
annotations = []
for page_num, page in enumerate(doc):
for annot in page.annots():
annotation_data = {}
target_data = {}
selector_data = {}
# Common properties for all annotation types
annotation_data["@context"] = [
"https://www.w3.org/ns/anno.jsonld",
"https://comments.acrobat.com/ns/anno.jsonld",
]
annotation_data["id"] = annot.info.get("id", "")
annotation_data["type"] = "Annotation"
annotation_data["motivation"] = "commenting"
annotation_data["bodyValue"] = annot.info.get("content", "")
# Target properties
# Replace this with the actual source identifier
target_data["selector"] = selector_data
annotation_data["target"] = target_data
# Selector properties
selector_data["node"] = {"index": page_num}
selector_data["type"] = "AdobeAnnoSelector"
if annot.type[0] == 8: # Highlight annotation
all_coordinates = annot.vertices
highlights = []
for i in range(0, len(all_coordinates), 4):
quad = all_coordinates[i:i + 4]
highlight_coord = fitz.Quad(quad).rect
highlights.extend(
[highlight_coord.x0, highlight_coord.y0, highlight_coord.x1, highlight_coord.y1])
selector_data["quadPoints"] = highlights
# Adjust the opacity value as needed
selector_data["opacity"] = 0.4
selector_data["subtype"] = "highlight"
selector_data["boundingBox"] = [
annot.rect.x0,
annot.rect.y0,
annot.rect.x1,
annot.rect.y1,
]
# Adjust the stroke color as needed
selector_data["strokeColor"] = "#fccb00"
# Adjust the stroke width as needed
selector_data["strokeWidth"] = 3
# Creator properties
annotation_data["creator"] = {
# Replace this with the actual creator's name
"name": annot.info.get("title", ""),
"type": "Person",
}
annotations.append(annotation_data)
# Save the annotations to a JSON file
with open(filename, 'w') as f:
json.dump(annotations, f, indent=4)
print(f'Annotations saved to {filename}')
Upvotes: -2
Views: 290
Reputation: 2763
The original annotation is:
{
"/Veeva.Vault.Annot": {
"U": "{\"instanceId\":1296,\"docVersionId\":223360,\"annotateKeyDate\":\"2020-06-30\",\"annotateKeyCode\":\"eM3jyxyu\",\"noteId\":\"1593513852533\",\"userId\":4744685}"
},
"/F": { "I": 128 },
"/C": 155,
"/Contents": {
"U": "Anchor Name: Definition of fistula - Alofisel Scientific Communications Platform (2.0) p.2, Alofisel Scientific Communications Platform (2.0) p.8, Alofisel Scientific Platform (0.7) p.2, Alofisel Scientific Platform - Pillar 1 - Unmet needs and burden of disease (1.0) p.2, Alofisel Scientific Platform (0.6) p.8, Alofisel Scientific Platform (0.5) p.2, Alofisel Scientific Platform (0.5) p.8, Alofisel Scientific Communications Platform (0.10) p.8, Alofisel Scientific Platform (0.7) p.8, Voiceover script for animation (0.4) p.1, Alofisel Scientific Platform - Pillar 1 (0.3) p.2, Alofisel Scientific Communications Platform (1.0) p.2, Voiceover script for animation (0.2) p.1, Alofisel Scientific Communications platform - glossary (2.0) p.8, Alofisel Scientific Communications Platform (1.0) p.8, Alofisel Scientific Platform (0.3) p.2, Voiceover script for animation (0.5) p.1, Alofisel Scientific Platform - Pillar 1 - Unmet needs and burden of disease (1.0) p.2, Darvadstrocel_RWE-Daten_22 (0.2) p.5, Darvadstrocel_RWE-Daten_22 (1.0) p.5, Alofisel Scientific Platform (0.6) p.2, Crohn's perianal fistulas - Pathophysiology animation for congress use (1.0) p.1, Alofisel Scientific Platform (0.8) p.8, Voiceover script for animation (0.5) p.1, Alofisel Scientific Platform (0.8) p.2, Voiceover script for animation (1.0) p.1, Alofisel Scientific Platform (0.4) p.8, Voiceover script for animation (1.0) p.1, Alofisel Scientific Platform - Pillar 1 (0.4) p.2, Alofisel Scientific Platform (0.2) p.2, Crohn's perianal fistulas - Pathophysiology animation for congress use (0.2) p.1, Alofisel Scientific Platform (0.4) p.2, Darvadstrocel_RWE-Daten_22 (0.3) p.5, Voiceover script for animation (0.3) p.1, Voiceover script for animation (0.4) p.1, Alofisel Scientific Platform - Pillar 1 (0.2) p.2, Alofisel Scientific Platform (0.2) p.5, Alofisel Scientific Platform (0.9) p.8, Alofisel Scientific Communications Platform (0.10) p.2, Alofisel Scientific Platform - Pillar 1 (0.4) p.2, Alofisel Scientific Platform (0.3) p.5, Alofisel Scientific Platform - Pillar 1 (0.3) p.2, Alofisel Scientific Platform (0.9) p.2"
},
"/M": { "U": "D:20200630104413+00'00'" },
"/NM": { "U": "1593513852533" },
"/CA": { "F": 1.0 },
"/Subj": {
"U": "A perianal ï¬stula (Latin for pipe ) is a chronic track of granulation tissue connecting 2 epithe- lial lined surfaces .1"
},
"/T": { "U": "Kate Herring" },
"/Rotate": { "I": 0 },
"/Rect": { "F": 315.676 }, { "F": 282.183 }, { "F": 394.806 }, { "F": 291.863 },
"/QuadPoints": [
{ "F": 396.603 },
{ "F": 317.863 },
{ "F": 555.722 },
{ "F": 317.863 },
{ "F": 396.603 },
{ "F": 308.183 },
{ "F": 555.722 },
{ "F": 308.183 },
{ "F": 315.676 },
{ "F": 304.863 },
{ "F": 555.7329999999999 },
{ "F": 304.863 },
{ "F": 315.676 },
{ "F": 295.183 },
{ "F": 555.7329999999999 },
{ "F": 295.183 },
{ "F": 315.676 },
{ "F": 291.863 },
{ "F": 394.806 },
{ "F": 291.863 },
{ "F": 315.676 },
{ "F": 282.183 },
{ "F": 394.806 },
{ "F": 282.183 }
]
"/Subtype": { "N": "/Highlight" },
"/Type": { "N": "/Annot" }
}
The re-added annotation is:
{
"/T": { "U": "Kate Herring" },
"/CA": { "I": 1 },
"/CreationDate": { "U": "D:20230329191605Z00'00" },
"/QuadPoints": [
{ "F": 387.666657 },
{ "I": 483 },
{ "I": 547 },
{ "I": 483 },
{ "F": 387.666657 },
{ "I": 493 },
{ "I": 547 },
{ "I": 493 },
{ "I": 307 },
{ "F": 496.333344 },
{ "I": 547 },
{ "F": 496.333344 },
{ "I": 307 },
{ "F": 505.666657 },
{ "I": 547 },
{ "F": 505.666657 },
{ "I": 307 },
{ "I": 509 },
{ "F": 385.666657 },
{ "I": 509 },
{ "I": 307 },
{ "I": 519 },
{ "F": 385.666657 },
{ "I": 519 }
],
"/AP": { "/N": 351 },
"/Popup": { "I": 1 },
"/C": [ { "F": 0.972549 }, { "F": 0.819608 }, { "F": 0.278431 } ],
"/Rect": [
{ "F": 306.6875 },
{ "F": 482.6875 },
{ "F": 547.3125 },
{ "F": 519.3125 }
],
"/M": { "U": "D:20230329191605Z00'00" },
"/F": { "I": 4 },
"/P": { "I": 1 },
"/Type": { "N": "/Annot" },
"/Subtype": { "N": "/Highlight" }
}
As you can see, the /QuadPoints and the (older, less accurate) /Rect are wrong. What is equally noticeable is that several other things are different. In other words, the round-trip process seems to have happened at a higher level of abstraction, or with a tool which likes to change things.
So I'm afraid I don't have an answer for you, but I hope this extra information helps in some way.
Upvotes: 2