Reputation: 1950
PDFBox version 2.0.20
I'm trying to detect the highlighted text (appeared in the black boxes on page#5,6,7,9) for the following PDF:
I used the solution proposed in this comment with no luck to detect them. For example: page.getAnnotations()
returns empty list. Any Idea how to detect them?
Upvotes: 0
Views: 240
Reputation: 11821
No need to detect them the original text is there, that is a classic case of redaction failure it does not matter if the highlight is black or see through yellow. Just copy and paste or export the pages as plain text.
Here we can see there is no direct relationship between the black rectangles "paths" or the text that's below them. They are independent objects on the page. Only good downstream processing could marry them together.
The zone of interest is a region of multiple rectangles with ragged edges and trying to match any text that is within or overlapping that zone of interest with variable means of clipping the text between inside and out, which is the reason redaction is a common fail. Sounds like one big challenge that requires lots and lots of honing.
[Later Edit]
The pdfbox team can give advice. and @TilmanHausherr suggested start by looking at pdfbox 2.0.2 > Calling of PageDrawer.processPage method caught exceptions
Upvotes: 1