PDFBox: Detecting the highlighted text in a given page

Question

PDFBox version 2.0.20

I'm trying to detect the highlighted text (appeared in the black boxes on page#5,6,7,9) for the following PDF:

https://www.courthousenews.com/wp-content/uploads/2019/01/Manafort-response.pdf

I used the solution proposed in this comment with no luck to detect them. For example: page.getAnnotations() returns empty list. Any Idea how to detect them?

K J · Accepted Answer

No need to detect them the original text is there, that is a classic case of redaction failure it does not matter if the highlight is black or see through yellow. Just copy and paste or export the pages as plain text.

Here we can see there is no direct relationship between the black rectangles "paths" or the text that's below them. They are independent objects on the page. Only good downstream processing could marry them together.

The zone of interest is a region of multiple rectangles with ragged edges and trying to match any text that is within or overlapping that zone of interest with variable means of clipping the text between inside and out, which is the reason redaction is a common fail. Sounds like one big challenge that requires lots and lots of honing.

[Later Edit]

The pdfbox team can give advice. and @TilmanHausherr suggested start by looking at pdfbox 2.0.2 > Calling of PageDrawer.processPage method caught exceptions

PDFBox: Detecting the highlighted text in a given page

Answers (1)

Related Questions