Reputation: 260
I want to search every matched keyword in a pdf file and get their position in the page which they located.
I just found some code in iText5 which looks like match what I need
for (i = 1; i <= pageNum; i++)
{
pdfReaderContentParser.processContent(i, new RenderListener()
{
@Override
public void renderText(TextRenderInfo textRenderInfo)
{
String text = textRenderInfo.getText();
if (null != text && text.contains(KEY_WORD))
{
Float boundingRectange = textRenderInfo
.getBaseline().getBoundingRectange();
resu = new float[3];
System.out.println("======="+text);
System.out.println("h:"+boundingRectange.getHeight());
System.out.println("w:"+boundingRectange.width);
System.out.println("centerX:"+boundingRectange.getCenterX());
System.out.println("centerY:"+boundingRectange.getCenterY());
System.out.println("x:"+boundingRectange.getX());
System.out.println("y:"+boundingRectange.getY());
System.out.println("maxX:"+boundingRectange.getMaxX());
System.out.println("maxY:"+boundingRectange.getMaxY());
System.out.println("minX:"+boundingRectange.getMinX());
System.out.println("minY:"+boundingRectange.getMinY());
resu[0] = boundingRectange.x;
resu[1] = boundingRectange.y;
resu[2] = i;
}
}
@Override
public void renderImage(ImageRenderInfo arg0)
{
}
@Override
public void endTextBlock()
{
}
@Override
public void beginTextBlock()
{
}
});
But I don't know how to deal with it in iText7 .
Upvotes: 0
Views: 2718
Reputation: 12312
iText7 has pdf2Data add-on which can easily help you achieving your goal (and help with other data extraction cases).
Let's say you want to extract positions of word Header
. We go to https://pdf2data.online demo application, upload our template (any file containing the words you want to extract), and go to data field editor which looks like this:
Now, you can add a data field with a selector that would select the data you are interested in. In this case you can use Regular expression selector which is very flexible generally, but in our case the settings are pretty straightforward:
You can see that the editor application highlights all occurrences of the word we are searching for. Now, let's get back to the first step (there is an icon at the top right of the editor to go back to demo), and download our template (link to the bottom of the icon corresponding to the uploaded file).
Now you can look over the information on how to include pdf2Data in your project at this page: https://pdf2data.online/gettingStarted, roughly the code you need is the following:
LicenseKey.loadLicenseFile("license.xml");
Template template = Pdf2DataExtractor.parseTemplateFromPDF("Template.pdf");
Pdf2DataExtractor extractor = new Pdf2DataExtractor(template);
ParsingResult result = extractor.recognize("toParse.pdf");
for (ResultElement element : result.getResults("Headers")) {
Rectangle bbox = element.getBbox();
int page = element.getPage();
System.out.println(MessageFormat.format("Coordinates on page {0}: [{1}, {2}, {3}, {4}]",
page, bbox.getX(), bbox.getY(), bbox.getX() + bbox.getWidth(), bbox.getY() + bbox.getHeight()));
}
Example output:
Coordinates on page 1: [38.5, 788.346, 77.848, 799.446]
Coordinates on page 1: [123.05, 788.346, 162.398, 799.446]
Coordinates on page 1: [207.6, 788.346, 246.948, 799.446]
Coordinates on page 2: [38.5, 788.346, 77.848, 799.446]
Coordinates on page 2: [123.05, 788.346, 162.398, 799.446]
Coordinates on page 2: [207.6, 788.346, 246.948, 799.446]
pdf2Data add-on is closed source and available only at a commercial license option at the moment. Of course it is possible to port your code directly to iText7 and this would be another solution to the task you have, but I must warn you that your code is not universal for all scenarios, e.g. text in a PDF can be written letter by letter, instead of writing a whole word at once (the visual appearance of the two PDFs can easily stay the same), and in this case the code you attached would not work. pdf2Data handles those cases out of the box, taking the burden out of your shoulders.
Upvotes: 1