Gigi
Gigi

Reputation: 315

Get text occurrences contained in a specified area with iTextSharp

Is it possible, using iTextSharp, get all text occurrences contained in a specified area of ​​a pdf document?

enter image description here

Thanks.

Upvotes: 6

Views: 6396

Answers (2)

Ogglas
Ogglas

Reputation: 70194

@BrunoLowagie gives an excellent answer but something I really struggled with was getting the actual coordinates to use. I started out with using Cursor Coordinates from Adobe Acrobat Pro.

enter image description here

From here I could get the coordinate in inches and calculate the DTP point (PostScript points) by multiplying the value with 72.

enter image description here

enter image description here

However something was still not right. Looking at the Y value this seemed way off. I then noticed that Adobe Acrobat counts coordinates in this view from the top left instead of bottom left. This means that Y needs to be calculated.

I solved this in code like this:

var rect = new RectangleJ(GetPostScriptPoints(4.19f), 
    GetPostScriptPoints(GetInverseCoordinateInInches(pdfReader, 1, 1.42f)),
    GetPostScriptPoints(3.5f), GetPostScriptPoints(0.39f));

RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
        new LocationTextExtractionStrategy(), filter);
var output = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);

private float GetPostScriptPoints(float inch)
{
    return inch * 72;
}

private float GetInverseCoordinateInInches(PdfReader pdfReader, int pageIndex, float coordinateInInches)
{
    Rectangle mediabox = pdfReader.GetPageSize(pageIndex); 
    return mediabox.Height / 72 - coordinateInInches; 
}

enter image description here

This worked but I think it looks a little messy. I then used the tool Prepare Form in Adobe Acrobat Pro and here the Y coordinate showed up correctly when looking at Text Field Properties. It could also convert the box into points right away.

enter image description here

enter image description here

This means I could write code like this instead:

var rect = new RectangleJ(301.68f, 738f, 252f, 28.08f);

RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
        new LocationTextExtractionStrategy(), filter);
var output = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);

enter image description here

This was a lot cleaner and faster so this was the way I choose to do it in the end.

See this answer if you would like to get a value from a specific location for every page in the document:

https://stackoverflow.com/a/20959388/3850405

Upvotes: 3

Bruno Lowagie
Bruno Lowagie

Reputation: 77606

First you need the actual coordinates of the rectangle you marked in Red. On sight, I'd say the x value 144 (2 inches) is probably about right, but it would surprise me if the y value is 76, so you'll have to double check.

Once you have the exact coordinates of the rectangle, you can use iText's text extraction functionality using a LocationTextExtractionStrategy as is done in the ExtractPageContentArea example.

For the iTextSharp version of this example, see the C# port of the examples of chapter 15.

System.util.RectangleJ rect = new System.util.RectangleJ(70, 80, 420, 500);
RenderFilter[] filter = {new RegionTextRenderFilter(rect)};
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
        new LocationTextExtractionStrategy(), filter);
text = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);

Upvotes: 11

Related Questions