JDChi
JDChi

Reputation: 260

How can I get the position of the specified keyword in iText7?

I want to search every matched keyword in a pdf file and get their position in the page which they located.

I just found some code in iText5 which looks like match what I need

for (i = 1; i <= pageNum; i++)
    {
        pdfReaderContentParser.processContent(i, new RenderListener()
        {

            @Override
            public void renderText(TextRenderInfo textRenderInfo)
            {
                String text = textRenderInfo.getText();
                if (null != text && text.contains(KEY_WORD))
                {
                    Float boundingRectange = textRenderInfo
                            .getBaseline().getBoundingRectange();
                    resu = new float[3];
                    System.out.println("======="+text);
                    System.out.println("h:"+boundingRectange.getHeight());
                    System.out.println("w:"+boundingRectange.width);
                    System.out.println("centerX:"+boundingRectange.getCenterX());
                    System.out.println("centerY:"+boundingRectange.getCenterY());
                    System.out.println("x:"+boundingRectange.getX());
                    System.out.println("y:"+boundingRectange.getY());
                    System.out.println("maxX:"+boundingRectange.getMaxX());
                    System.out.println("maxY:"+boundingRectange.getMaxY());
                    System.out.println("minX:"+boundingRectange.getMinX());
                    System.out.println("minY:"+boundingRectange.getMinY());
                    resu[0] = boundingRectange.x;
                    resu[1] = boundingRectange.y;
                    resu[2] = i;
                }
            }

            @Override
            public void renderImage(ImageRenderInfo arg0)
            {
            }

            @Override
            public void endTextBlock()
            {

            }

            @Override
            public void beginTextBlock()
            {
            }
        });

But I don't know how to deal with it in iText7 .

Upvotes: 0

Views: 2718

Answers (1)

Alexey Subach
Alexey Subach

Reputation: 12312

iText7 has pdf2Data add-on which can easily help you achieving your goal (and help with other data extraction cases).

Let's say you want to extract positions of word Header. We go to https://pdf2data.online demo application, upload our template (any file containing the words you want to extract), and go to data field editor which looks like this:

pdf2Data data field editor

Now, you can add a data field with a selector that would select the data you are interested in. In this case you can use Regular expression selector which is very flexible generally, but in our case the settings are pretty straightforward:

data field configuration

You can see that the editor application highlights all occurrences of the word we are searching for. Now, let's get back to the first step (there is an icon at the top right of the editor to go back to demo), and download our template (link to the bottom of the icon corresponding to the uploaded file).

Now you can look over the information on how to include pdf2Data in your project at this page: https://pdf2data.online/gettingStarted, roughly the code you need is the following:

LicenseKey.loadLicenseFile("license.xml");

Template template = Pdf2DataExtractor.parseTemplateFromPDF("Template.pdf");
Pdf2DataExtractor extractor = new Pdf2DataExtractor(template);
ParsingResult result = extractor.recognize("toParse.pdf");
for (ResultElement element : result.getResults("Headers")) {
    Rectangle bbox = element.getBbox();
    int page = element.getPage();
    System.out.println(MessageFormat.format("Coordinates on page {0}: [{1}, {2}, {3}, {4}]",
            page, bbox.getX(), bbox.getY(), bbox.getX() + bbox.getWidth(), bbox.getY() + bbox.getHeight()));
}

Example output:

Coordinates on page 1: [38.5, 788.346, 77.848, 799.446]
Coordinates on page 1: [123.05, 788.346, 162.398, 799.446]
Coordinates on page 1: [207.6, 788.346, 246.948, 799.446]
Coordinates on page 2: [38.5, 788.346, 77.848, 799.446]
Coordinates on page 2: [123.05, 788.346, 162.398, 799.446]
Coordinates on page 2: [207.6, 788.346, 246.948, 799.446]

pdf2Data add-on is closed source and available only at a commercial license option at the moment. Of course it is possible to port your code directly to iText7 and this would be another solution to the task you have, but I must warn you that your code is not universal for all scenarios, e.g. text in a PDF can be written letter by letter, instead of writing a whole word at once (the visual appearance of the two PDFs can easily stay the same), and in this case the code you attached would not work. pdf2Data handles those cases out of the box, taking the burden out of your shoulders.

Upvotes: 1

Related Questions