prak
prak

Reputation: 153

How to get coordinates of each word of PDF?

For each word I am creating an object of LocationTextExtractionStrategy class to get its coordinates but the problem is each time I pass a word it is returning coordinates of all the chunks of that word present in pdf. How can i get coordinates of the word present at specific position or in a specific line?

I found a code somewhere

namespace PDFAnnotater
 {
   public class RectAndText
    {
      public iTextSharp.text.Rectangle Rect;
      public string Text;
      public RectAndText(iTextSharp.text.Rectangle rect, string text)
       {
        this.Rect = rect;
        this.Text = text;
       }
    }

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{

    public List<RectAndText> myPoints = new List<RectAndText>();


    public string TextToSearchFor { get; set; }


    public System.Globalization.CompareOptions CompareOptions { get; set; }

    public MyLocationTextExtractionStrategy(string textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None)
    {
        this.TextToSearchFor = textToSearchFor;
        this.CompareOptions = compareOptions;
    }


    public override void RenderText(TextRenderInfo renderInfo)
    {
        base.RenderText(renderInfo);


        var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);

        //If not found bail
        if (startPosition < 0)
        {
            return;
        }


        var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();

        //Grab the first and last character
        var firstChar = chars.First();
        var lastChar = chars.Last();


        //Get the bounding box for the chunk of text
        var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
        var topRight = lastChar.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );


        this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
    }
}

}

I am passing words from an array to check for its coordinates. The problem is that RenderText() method is automatically called again and again for each chunk and returns the list of coordinates of the word present at different places in the pdf. For example if i need coordinate of '0' it is returning 23 coordinates. What should I do or modify in the code to get the exact coordinate of the word?

Upvotes: 0

Views: 1872

Answers (1)

Joris Schellekens
Joris Schellekens

Reputation: 9057

Your question is a bit confusing.

How can I get coordinates of the word present at specific position

In that statement you're basically saying "How can I get the coordinates of something that I already know the coordinates of?" Which is redundant.

I'm going to interpret your question as "How can I get the coordinates of a word, if I know the approximate location?"

I'm not familiar with C#, but I assume there are methods similar to the ones in Java for working with Rectangle objects.

Rectangle#intersects(Rectangle other)
Determines whether or not this Rectangle and the specified Rectangle intersect.

and

Rectangle#contains(Rectangle other)
Tests if the interior of the Shape entirely contains the specified Rectangle2D.

Then the code becomes trivially easy.

  1. You use LocationTextExtractionStrategy to fetch all the iText based rectangles
  2. you convert them to native rectangle objects (or write your own class)
  3. for every rectangle you test whether the given search region contains that rectangle, keeping only those that are within the search region

If you want to implement your second use-case (getting the location of a word if you know the line) then there are two options:

  1. you know the rough coordinates of the line
  2. you want this to work given a line number

For option 1:

  • build a search region. Use the bounds of the page to get an idea of the width (since the line could stretch over the entire width), and add some margin y)-coordinates (to account for font differences, subscript and superscript, etc)
  • Now that you have a search region, this reverts to my earlier answer.

For option 2:

  • you already have the y coordinate of every word
  • round those (to the nearest multiple of fontsize)
  • build a Map where you keep track of how many times a certain y-coordinate is used
  • remove any statistical outliers
  • put all these values in a List
  • sort the list

This should give you a rough idea of where you can expect a given line(number) to be.

Of course, similar to my earlier explanation, you will need to take into account some padding and some degree of flexibility to get the right answer.

Upvotes: 3

Related Questions