Reputation: 15
I am trying to extract data from a particular PDF region using PDFTextStripperByArea and the only data that I am interested to extract is coming in jumbled order, rest all the page data comes properly. This is on PDFBox versions 2.0.7.
When I try the same using legacy version 1.8.x, it extracts the data properly.
The area that I am extracting appears to be different font as compared to the other data in PDF. I am a little confused on what wrong is happening, is there any way to scrape the data correctly using the newer versions since I cannot fall back on older version due to other dependencies.
What I have tried: -
The code snippet -
Rectangle region = new Rectangle();
region.setRect(55, 75.80, 160, 100);
PDDocument pdfDoc = PDDocument.load(new File(pdfFilePath));
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea ();
stripperByArea.setSortByPosition(true);
stripperByArea.addRegion("CVAM", region);
stripperByArea.extractRegions(pdfDoc.getPages().get(0));
return stripperByArea.getTextForRegion("CVAM");
I am sharing the PDF file link in the comment Thanks in advance!!!!!
Upvotes: 0
Views: 249
Reputation: 95928
The fonts in your PDF have very unrealistic metadata. In particular their Ascent, Descent, CapHeight, and FontBBox entries contain values that claim that the glyphs are about twice as high as they actually are. As the visual text lines in your PDF are set quite tightly, this means that a PDF tool following those metadata must assume that there actually are not three but one or probably two text lines with some letters raised a bit and some lowered a bit. Sorting, therefore, results in a hodgepodge.
You can check that not only PDFBox has issues with these fonts. E.g. opening the PDF in Adobe Reader and clicking into the text you get a giant cursor bar:
and copying&pasting the address results in
1D4A0N0I EHL IDD DEPNO WELALKLES DR MT PLEASANT SC 29464-9473
Nonetheless, following @Tilman's remark that 2.0.21 will have the possibility to set own height calculations, I made use of that feature in the current PDFBox development head to supply a constant, low font height:
PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea() {
@Override
protected float computeFontHeight(PDFont font) throws IOException {
return .5f;
}
};
stripperByArea.setSortByPosition(false);
stripperByArea.addRegion("CVAM", region);
stripperByArea.extractRegions(pdfDoc.getPages().get(0));
String text = stripperByArea.getTextForRegion("CVAM");
(from ExtractText test testCustomFontHeightYOYO
)
Both with SortByPosition
set to true
and false
the result now is:
DANIEL D POWELL
1400 HIDDEN LAKES DR
MT PLEASANT SC 29464-9473
Upvotes: 2