Java: Apache PDFbox Extract highlighted text

Question

I am using Apache PDFbox library to extract the the highlighted text (i.e., with yellow background) from a PDF file. I am totally new to this library and don't know which class from it to be used for this purpose. So far I have done extraction of text from comments using below code.

PDDocument pddDocument = PDDocument.load(new File("test.pdf"));
    List allPages = pddDocument.getDocumentCatalog().getAllPages();
    for (int i = 0; i < allPages.size(); i++) {
    int pageNum = i + 1;
    PDPage page = (PDPage) allPages.get(i);
    List la = page.getAnnotations();
    if (la.size() < 1) {
    continue;
    }
    System.out.println("Total annotations = " + la.size());
    System.out.println("
Process Page " + pageNum + "...");
    // Just get the first annotation for testing
    PDAnnotation pdfAnnot = la.get(0); 
    System.out.println("Getting text from comment = " + pdfAnnot.getContents());

Now I need to get the highlighted text, any code example will be highly appreciated.

mkl · Accepted Answer

The code in the question Not able to read the exact text highlighted across the lines already illustrates most concepts to use for extracting text from limited content regions on a page with PDFBox.

Having studied this code, the OP still wondered in a comment:

But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?

In general the area an annotation refers to is a rectangle:

Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.

(from Table 164 – Entries common to all annotation dictionaries - in ISO 32000-1)

For some annotations types (e.g. text markups), this location value does not suffice because:

text to markup may be written at some odd angle but the rectangle type mentioned in the specification refers to rectangles with edges parallel to the page edges; and
text to markup may start anywhere in a line and end anywhere in another one, so the markup area is not rectangular at all but it is the union of multiple rectangular parts.

To cope with such annotation types, therefore, the PDF specification provides a more generic way to define areas:

QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order

x₁ y₁ x₂ y₂ x₃ y₃ x₄ y₄

specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x₁, y₁) and (x₂, y₂).

(from Table 179 – Additional entries specific to text markup annotations - in ISO 32000-1)

Thus, instead of the rectangle given by

PDRectangle rect = pdfAnnot.getRectangle();

in the code in the referenced question, you have to consider the quadrilaterals given by

COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));

and define regions for the PDFTextStripperByArea stripper accordingly. Unfortunately PDFTextStripperByArea.addRegion expects a rectangle as parameter, not some generic quadrilateral. As text usually is printed horizontally or vertically, that should not pose too big a problem.

PS One warning concerning the specification of the QuadPoints, the order may differ in real-life PDFs, cf. the question PDF Spec vs Acrobat creation (QuadPoints).

Java: Apache PDFbox Extract highlighted text

Answers (2)

Related Questions