Reputation: 43
My task is to extract text from PDF for a specific coordinates.
I have used Apache Pdfbox client for data extraction .
To get the x, y , height and width coordinates from the PDF i am using PDF X change tool which is in Millimeter. When i pass the value in the rectangle the values are not getting empty value.
public String getTextUsingPositionsUsingPdf(String pdfLocation, int pageNumber, double x, double y, double width,
double height) throws IOException {
String extractedText = "";
// PDDocument Creates an empty PDF document. You need to add at least
// one page for the document to be valid.
// Using load method we can load a PDF document
PDDocument document = null;
PDPage page = null;
try {
if (pdfLocation.endsWith(".pdf")) {
document = PDDocument.load(new File(pdfLocation));
int getDocumentPageCount = document.getNumberOfPages();
System.out.println(getDocumentPageCount);
// Get specific page. THe parameter is pageindex which starts with // 0. If we need to
// access the first page then // the pageIdex is 0 PDPage
if (getDocumentPageCount > 0) {
page = document.getPage(pageNumber + 1);
} else if (getDocumentPageCount == 0) {
page = document.getPage(0);
}
// To create a rectangle by passing the x axis, y axis, width and height
Rectangle2D rect = new Rectangle2D.Double(x, y, width, height);
String regionName = "region1";
// Strip the text from PDF using PDFTextStripper Area with the
// help of Rectangle and named need to given for the rectangle
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
stripper.addRegion(regionName, rect);
stripper.extractRegions(page);
System.out.println("Region is " + stripper.getTextForRegion("region1"));
extractedText = stripper.getTextForRegion("region1");
} else {
System.out.println("No data return");
}
} catch (IOException e) {
System.out.println("The file not found" + "");
} finally {
document.close();
}
// Return the extracted text and this can be used for assertion
return extractedText;
}
Please suggest whether my way is correct or not..
Upvotes: 0
Views: 987
Reputation: 95918
I have used this PDF tutorialspoint.com/uipath/uipath_tutorial.pdf.. Where i am trying to find the text "a part of contests" which is have x = 55.6 mm y = 168.8 width = 210.0 mm and height = 297.0. But i am getting empty value
I tested your method with those inputs:
System.out.println("Extracting like Venkatachalam Neelakantan from uipath_tutorial.pdf\n");
float MM_TO_UNITS = 1/(10*2.54f)*72;
String text = getTextUsingPositionsUsingPdf("src/test/resources/mkl/testarea/pdfbox2/extract/uipath_tutorial.pdf",
0, 55.6 * MM_TO_UNITS, 168.8 * MM_TO_UNITS, 210.0 * MM_TO_UNITS, 297.0 * MM_TO_UNITS);
System.out.printf("\n---\nResult:\n%s\n", text);
(ExtractText test testUiPathTutorial
)
and got the result
part of contents of this e-book in any manner without written consent
te the contents of our website and tutorials as timely and as precisely as
, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
guarantee regarding the accuracy, timeliness or completeness of our
tents including this tutorial. If you discover any errors on our website or
ease notify us at [email protected]
i
Assuming you actually were looking for "a part of contents", not "a part of contests", merely the 'a' is missing; probably when measuring you looked for the beginning of the visible letter drawing but the actual glyph origin is a bit before that. If you choose a slightly smaller x, e.g. 54.6 mm, you'll also get the 'a'.
It obviously is no surprise that you get more than "a part of contents", considering the width and height of your rectangle.
Should you wonder about the MM_TO_UNITS
constant, have a look at this answer.
Upvotes: 1