user3253099
user3253099

Reputation: 91

How to extract bookmarks from a PDF?

When I open a PDF in a PDF viewer, I see a series of bookmarks on the left side of the actual document. The information shown there doesn't seem to make part of the actual content of the document: it isn't printed, it's not present on a specific page.

How can I extract these bookmarks using Java?

Upvotes: 1

Views: 5015

Answers (3)

Rainer
Rainer

Reputation: 2185

To retrieve the bookmark content from a PDF file with Java you can use the pCOS interface of PDFlib+PDI 9. A sample code is included in the pCOS Cookbook: http://www.pdflib.com/en/pcos-cookbook/interactive-elements/bookmarks/

Upvotes: 2

Kurt Pfeifle
Kurt Pfeifle

Reputation: 90213

The OP question asked for a solution with Java.

However, this is may be a topic of more general interest to people who have to handle PDFs. So my answer offers a command line solution: mutool.

mutool is a command line utility bundled with the MuPDF viewer software, written by the same company which gave us Ghostscript.

Its latest version includes the show sub-command, which can be used to print outlines (that is in PDF technical parlance what the OP and the Adobe UI call "bookmarks"), amongst other specific items of interest from a PDF:

$ mutool show PDF32000_2008.pdf outlines

  Document management — Portable document format — Part 1: PDF 1.7  1
  Contents Page 3
  Foreword  6
  Introduction  7
  1 Scope   9
  2 Conformance 9
    2.1 General 9
    2.2 Conforming readers  9
    2.3 Conforming writers  9
    2.4 Conforming products 10
  3 Normative references    10
  4 Terms and definitions   14
  5 Notation    18
  6 Version Designations    18
  7 Syntax  19
    7.1 General 19
    7.2 Lexical Conventions 19
        7.2.1 General   19
        7.2.2 Character Set 20
        7.2.3 Comments  21
  [....]

(Output was shortened.) The original PDF document (the official PDF-1.7 specification), contains this page as the ToC:

Original Document

You can clearly see, how the /Outlines contents are different (but similar) to the included table of contents page.

Here is how the outlines ("bookmarks") are displayed in Adobe Reader XI:

enter image description here

Upvotes: 5

Bruno Lowagie
Bruno Lowagie

Reputation: 77528

Please download the free ebook The Best iText Questions on StackOverflow. In that book, you'll find the answer to many questions, including to the question Reading PDF Bookmarks in VB.NET using iTextSharp

The coolest way to extract bookmarks, is by creating an XML file that shows the bookmarks in a nice hierarchical way:

PdfReader reader = new PdfReader(src);
List<HashMap<String, Object>> list = SimpleBookmark.getBookmark(reader);
SimpleBookmark.exportToXML(list,
    new FileOutputStream(dest), "ISO8859-1", true); 
reader.close();

Upvotes: 3

Related Questions