chinna_82
chinna_82

Reputation: 6403

PdfBox - Get font Information using

I'm trying to get text from pdf using Square Annotation. I use below code to extract text from PDF using PDFBOX.
CODE

try {    
            PDDocument document = null;
            try {
                document = PDDocument.load(new File("//Users//" + usr + "//Desktop//BoldTest2 2.pdf"));
                List allPages = document.getDocumentCatalog().getAllPages();
                for (int i = 0; i < allPages.size(); i++) {
                    PDPage page = (PDPage) allPages.get(i);
                    Map<String, PDFont> pageFonts = page.getResources().getFonts();
                    List<PDAnnotation> la = page.getAnnotations();
                    for (int f = 0; f < la.size(); f++) {
                        PDAnnotation pdfAnnot = la.get(f);
                        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                        stripper.setSortByPosition(true);
                        PDRectangle rect = pdfAnnot.getRectangle();

                        float x = 0;
                        float y = 0;
                        float width = 0;
                        float height = 0;
                        int rotation = page.findRotation();

                        if (rotation == 0) {
                            x = rect.getLowerLeftX();
                            y = rect.getUpperRightY() - 2;
                            width = rect.getWidth();
                            height = rect.getHeight();
                            PDRectangle pageSize = page.findMediaBox();
                            y = pageSize.getHeight() - y;
                        }
                        Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
                        stripper.addRegion(Integer.toString(f), awtRect);
                        stripper.extractRegions(page);
                        PrintTextLocation2 prt = new PrintTextLocation2();
                        if (pdfAnnot.getSubtype().equals("Square")) {
                            testTxt = testTxt + "\n " + stripper.getTextForRegion(Integer.toString(f));
                        }
                    }
                }
            } catch (Exception ex) {
            } finally {
                if (document != null) {
                    document.close();
                }
            }
        } catch (Exception ex) {
        }

By using this code, I am only able to get the PDF text. How do I do to get the font information like BOLD ITALIC together within the text. Advice or references are highly appreciated.

Upvotes: 1

Views: 4417

Answers (1)

Salil
Salil

Reputation: 1811

The PDFTextStripper which is extended by PDFTextStripperByArea normalizes (i.e., removes formatting of) the text (cf. JavaDoc comment):

* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.

If you look at the source, you will see that the font information is available in this class, but it is normalized out before printing:

protected void writePage() throws IOException
{
    [...]
        List<TextPosition> line = new ArrayList<TextPosition>();
        [...]
            if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
            {
                writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                line.clear();
                [...]
            }
............

The TextPosition instances in the ArrayList have all the formatting information. Solutions can focus on re-defining the existing methods as per the requirement. I am listing a few options below:

  • private List normalize(List line, boolean isRtlDominant, boolean hasRtl)

If you want your own normalize method, you can copy the whole PDFTextStripper class in your project and change the code of the copy. Let's call this new class as MyPDFTextStripper and then define new method as per the requirement. Similarly copy PDFTextStripperByArea as MyPDFTextStripperByArea which would extend MyPDFTextStripper.

  • protected void writePage()

If you just need a new writePage method, you can simply extend PDFTextStripper, and override this method, then create MyPDFTextStripperByArea as described above.

  • writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant)

Other solution might override writeLine method by storing the pre-normalization information in some variable and then using it.

Hope this helps.

Upvotes: 3

Related Questions