codingscientist
codingscientist

Reputation: 1154

How to detect hidden text in pdf

We are extracting text from PDF using iText/PDFBox, but additional text, invisible in the PDF, also gets extracted. Is there any any method and/or tools to get rid of these hidden texts?

Upvotes: 2

Views: 6311

Answers (1)

Andrew Cash
Andrew Cash

Reputation: 2394

There are many different ways to add hidden text including

  1. Hidden on an hidden / invisible / locked content group layer
  2. White text colour on an OCG
  3. 100% transparent text
  4. ???

Each PDF may use a different method and to be able to separate them it you may need to know how the hidden text is implemented.

Does iText have an option to return the text colour ? If it does then you can try ignoring white coloured text objects.

Upvotes: 2

Related Questions