Reputation: 6403
I'm trying to get text from pdf using Square Annotation
. I use below code to extract text from PDF using PDFBOX.
CODE
try {
PDDocument document = null;
try {
document = PDDocument.load(new File("//Users//" + usr + "//Desktop//BoldTest2 2.pdf"));
List allPages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
PDPage page = (PDPage) allPages.get(i);
Map<String, PDFont> pageFonts = page.getResources().getFonts();
List<PDAnnotation> la = page.getAnnotations();
for (int f = 0; f < la.size(); f++) {
PDAnnotation pdfAnnot = la.get(f);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDRectangle rect = pdfAnnot.getRectangle();
float x = 0;
float y = 0;
float width = 0;
float height = 0;
int rotation = page.findRotation();
if (rotation == 0) {
x = rect.getLowerLeftX();
y = rect.getUpperRightY() - 2;
width = rect.getWidth();
height = rect.getHeight();
PDRectangle pageSize = page.findMediaBox();
y = pageSize.getHeight() - y;
}
Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion(Integer.toString(f), awtRect);
stripper.extractRegions(page);
PrintTextLocation2 prt = new PrintTextLocation2();
if (pdfAnnot.getSubtype().equals("Square")) {
testTxt = testTxt + "\n " + stripper.getTextForRegion(Integer.toString(f));
}
}
}
} catch (Exception ex) {
} finally {
if (document != null) {
document.close();
}
}
} catch (Exception ex) {
}
By using this code, I am only able to get the PDF text. How do I do to get the font information like BOLD ITALIC together within the text. Advice or references are highly appreciated.
Upvotes: 1
Views: 4417
Reputation: 1811
The PDFTextStripper
which is extended by PDFTextStripperByArea
normalizes (i.e., removes formatting of) the text (cf. JavaDoc comment):
* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.
If you look at the source, you will see that the font information is available in this class, but it is normalized out before printing:
protected void writePage() throws IOException
{
[...]
List<TextPosition> line = new ArrayList<TextPosition>();
[...]
if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
{
writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
line.clear();
[...]
}
............
The TextPosition
instances in the ArrayList have all the formatting information. Solutions can focus on re-defining the existing methods as per the requirement. I am listing a few options below:
If you want your own normalize
method, you can copy the whole PDFTextStripper
class in your project and change the code of the copy. Let's call this new class as MyPDFTextStripper
and then define new method as per the requirement. Similarly copy PDFTextStripperByArea
as MyPDFTextStripperByArea
which would extend MyPDFTextStripper
.
If you just need a new writePage
method, you can simply extend PDFTextStripper
, and override this method, then create MyPDFTextStripperByArea
as described above.
Other solution might override writeLine method by storing the pre-normalization
information in some variable and then using it.
Hope this helps.
Upvotes: 3