Shriram Kalpathy Mohan
Shriram Kalpathy Mohan

Reputation: 131

Extracting text between two bookmarks using Apache PdfBox

I am using Apache PDFBox to read a PDF document that has a hierarchy defined by bookmarks. The hierarchy is in a tree form with contents only at the leaf level.

Extracting the text between two leaf level bookmarks using the following code:

Stripper.setStartBookmark(), 
Stripper.setEndBookmark(),
Stripper.writeText()), 

Returns text in the whole page instead. In short, my problem is similar to that mentioned in this thread.

Is there a way to extract the contents between two bookmarks?

If so, what should be the change in my code?

Upvotes: 6

Views: 2235

Answers (1)

I am not smart
I am not smart

Reputation: 1441

I am guessing that your bookmark does not contain the correct data.

It sounds like the bookmark you are using is only pointing to the page where your content starts, rather than a location on the page.

Here is an example of a bookmark that contains location data:

<Title Action="GoTo" Style="bold" Page="2 FitH 518">
Title Name
</Title>

Upvotes: 0

Related Questions