PDFBox Newer version extracts data in jumbled order

Question

I am trying to extract data from a particular PDF region using PDFTextStripperByArea and the only data that I am interested to extract is coming in jumbled order, rest all the page data comes properly. This is on PDFBox versions 2.0.7.

When I try the same using legacy version 1.8.x, it extracts the data properly.

The area that I am extracting appears to be different font as compared to the other data in PDF. I am a little confused on what wrong is happening, is there any way to scrape the data correctly using the newer versions since I cannot fall back on older version due to other dependencies.

What I have tried: -

Running the PDF on the latest PDFBox version 2.0.20, still no luck
Try debugging out and turns out that setSortByPosition is doing the swapping in the initial step of processing the page, however, I cannot set it false else I lose the new-line characters [ plus the older version works fine when setSortByPosition is set to true]

The code snippet -

Rectangle region = new Rectangle();
region.setRect(55, 75.80, 160, 100);
PDDocument pdfDoc = PDDocument.load(new File(pdfFilePath));
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea ();
stripperByArea.setSortByPosition(true);
stripperByArea.addRegion("CVAM", region);
stripperByArea.extractRegions(pdfDoc.getPages().get(0));
return stripperByArea.getTextForRegion("CVAM");

I am sharing the PDF file link in the comment Thanks in advance!!!!!

mkl · Accepted Answer

The fonts in your PDF have very unrealistic metadata. In particular their Ascent, Descent, CapHeight, and FontBBox entries contain values that claim that the glyphs are about twice as high as they actually are. As the visual text lines in your PDF are set quite tightly, this means that a PDF tool following those metadata must assume that there actually are not three but one or probably two text lines with some letters raised a bit and some lowered a bit. Sorting, therefore, results in a hodgepodge.

You can check that not only PDFBox has issues with these fonts. E.g. opening the PDF in Adobe Reader and clicking into the text you get a giant cursor bar:

and copying&pasting the address results in

1D4A0N0I EHL IDD DEPNO WELALKLES DR MT PLEASANT SC 29464-9473

Nonetheless, following @Tilman's remark that 2.0.21 will have the possibility to set own height calculations, I made use of that feature in the current PDFBox development head to supply a constant, low font height:

PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea() {
    @Override
    protected float computeFontHeight(PDFont font) throws IOException {
        return .5f;
    }
};
stripperByArea.setSortByPosition(false);
stripperByArea.addRegion("CVAM", region);
stripperByArea.extractRegions(pdfDoc.getPages().get(0));
String text = stripperByArea.getTextForRegion("CVAM");

(from ExtractText test testCustomFontHeightYOYO)

Both with SortByPosition set to true and false the result now is:

DANIEL D POWELL
1400 HIDDEN LAKES DR
MT PLEASANT SC 29464-9473

PDFBox Newer version extracts data in jumbled order

Answers (1)

Related Questions