Harsh Donga
Harsh Donga

Reputation: 31

Can I tag a PDF programmatically?

Can an unstructured PDF be tagged using any tools/libraries? Only source of tagging a PDF was using Adobe Acrobat or Auto-Tag APIs (Not something which I am looking forward to + not so great results imo)

I know the bounding boxes and semantics of the elements (i.e paragraph, lists, headings, tables)

So, is there a way to manipulate PDF trees/objects? preferably in Python or JavaScript.

Any thoughts on the topic is appreciated!!

PDF spec Talks about "StructTreeRoot" for Tagged PDFs. Going deep inside for making these objects would be nerve-racking, so is there any high-level library to manipulate objects?

Upvotes: 2

Views: 2250

Answers (2)

Sam Piston
Sam Piston

Reputation: 11

A free service for some major PDFix features. Autotag is based on their internal algorithm which is customizable.

https://pdfix.io/add-tags-to-pdf/

Can be used in various languages or CLI.

For Python users here's an example of utilizing the AI object detection model for autotagging PDF content.

https://github.com/pdfix/pdfix-autotag-deepdoctection

Upvotes: 0

K J
K J

Reputation: 11940

A this time there is a good overview at https://commonlook.com/auto-tagging-pdfs/

Conclusion
Automated tagging solutions can be helpful to get the process started, but, in the end, none of them are perfect, some are downright lousy, and you’re most likely going to have to at least manually verify some stuff and probably have to fix a lot, too.

Tagging a PDF with all that entails needs to be done by the PDF writer so here is this page as Tagged by MS Edge or you can use Chromium/Foxit/Skia (e.g. use Chrome or Chromium Portable).

"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" --headless --print-to-pdf=C:\data\output.pdf --virtual-time-budget=1000 https://stackoverflow.com/questions/75483409/can-i-tag-a-pdf-programmatically/75500169

Consider how impossible this may be to do retrospectively word by word or even sentence or paragraph at a time, as PDF does not inherently have such constructions. Things like H1 are discarded by the paper printout generator as unrequired superfluous bloat for a printer.

enter image description here

OK the prime reason for tagging is the human challenged reader, so with a tagged PDF lets see how it fares. Here we are only dealing with one simple page without images or tables (the two most common reasons for checking tags)

enter image description here

So programmatically how will an iterative application driven by Python resolve the residual requirements which are missing.

Language, as a Human I know the language is English (that should have been obvious to a browser that speaks aloud)

The Title is missing but again that should be obvious is "TAGGING PDFS" suitable as a working title for approval by another person? Lets temporarily ignore the major errors that tagging and order of tabs is wrong. A human with eyes and brain to analyse why, can fix those, as they progress through all the pages human aspects, so can the "Human" read / navigate logically? will itself resolve the tags order, and at the same time, check if the page is visually suitable contrast for visually challenged.

So the tagging of a PDF is best done at the time a human completes their retrospective use of the page, and that is best done using "Pre-flight" "Post-flight" GUI applications, such as Acrobat.

Upvotes: 1

Related Questions