SuperNova
SuperNova

Reputation: 27466

Tag content in pdf

I have a pdf which looks like below. I would want to tag the paragraph as 'paragraph'. I have searched a lot about this, and there are ways to create a tagged pdf from scratch, or convert html content to tagged pdf, but I have not had success in tagging an existing pdf.

Given the coordinates can I tag a content in pdf. In this example, I want to tag the paragraph as paragraph tag. Thanks.

**A sample pdf**

1. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, 
sed diam nonum- my nibh euismod ncidunt ut laoreet dolore magna aliquam erat volutpat. 
Ut wisi enim ad minim veniam, quis nostrud exerci taon ullamcorper 
sus- cipit lobors nisl ut aliquip ex ea commodo consequat. 

Upvotes: 1

Views: 569

Answers (1)

Joris Schellekens
Joris Schellekens

Reputation: 9012

PDF is not a WYSIWYG format.
It's not because you see a paragraph that a computer program is able to see it.

In fact, an untagged PDF might look like this (pseudo-pdf-code):

go to location 10, 700
set the active font to Times New Roman
set the fontsize to 12
set the color to black
draw the glyph 'H'
go to coordinate 10, 680
draw the glyphs 'Lorem'

As you can tell from the example, instructions don't need to draw the text in reading order.

So the first challenge you're facing is to identify paragraphs. I worked at iText, I've talked to various people at Adobe. Being able to recognize structure in an untagged PDF document is not considered an easy problem.

Once you do have this structure (to the level of 'these glyphs make up a line' and 'these lines make up a paragraph' etc), it's a matter of creating a StructureTree

But since this usecase (re-tagging a PDF) was never thought possible, iText (or any other PDF library to my knowledge) isn't really designed to allow you to (easily) do this.

A tag itself is a part of separate datastructure inside the PDF. Tags can have children (for instance to indicate 'this paragraph contains these lines'). A tag itself will reference the objects (groups of instructions) that are part of it.

So you might have:

  • these instructions (to render a line of text) make up a word and form an object
  • these word objects are aggregated (by a tag) into a line object
  • a few line tags are aggregated into a paragraph object

For a thorough understanding, I recommend reading the PDF spec.

Upvotes: 2

Related Questions