RAHIL KAZI
RAHIL KAZI

Reputation: 13

How to change the coordinates of a text in a pdf page from lower left to upper left

I am using PDFBOX and itextsharp dll and processing a pdf. so that I get the text coordinates of the text within a rectangle. the rectangle coordinates are extracted using the itextsharp.dll. Basically I get the rectangle coordinates from itextsharp.dll, where itextsharp uses the coordinates system as lower left. And I get the pdf page text from PDFBOX, where PDFBOX uses the coordinates system as top upper left. I need help in converting the Coordinates from lower left to upper left

Updating my question

Pardon me if you didn't understood my question and if not full information was provided.

well, Let me try to give more details from start.

I am working on a tool where I get a PDF in which a rectangle is drawn using some Drawing markups within a comment section. Now I am reading the rectangle coordinates using iTextsharp

PdfDictionary pageDict = pdReader.GetPageN(page_no);
PdfArray annotArray = pageDict.GetAsArray(PdfName.ANNOTS);

where pdReader is PdfReader.

And the page text along with its coordinates is extracted using PDFBOX. where as I have a class created pdfBoxTextExtraction in this I process the text and coordinate such that it returns the text and llx,lly,urx,ury "line by line" please note line by line not sentence wise.

So I want to extract the text that lays within the Rectangle coordinates. I got stuck when the coordinates of the rectangle returned from itextsharp i.e llx,lly,urx,ury of a rectangle has an origin at lower left where as the text coordinates returned from PDFBOX has an origin at upper left .then I realised I need to adjust the y-axis so that the origin moves from lower left to upper left. for the I got the height of the page and height of the cropbox

iTextSharp.text.Rectangle mediabox = reader.GetPageSize(page_no);
iTextSharp.text.Rectangle cropbox = reader.GetCropBox(page_no);

Did some basic adjustment

lly=mediabox.Top - lly

ury=mediabox.Top - ury

in some case the adjustment worked, whereas in some PDFs needed to do adjustment on cropbox

lly=cropbox .Top - lly

ury=cropbox .Top - ury

where as on some PDFs didn't worked.

All I need is help in adjusting the rectangle coordinates so that I get the text within the rectangle.

Upvotes: 1

Views: 4293

Answers (2)

RAHIL KAZI
RAHIL KAZI

Reputation: 13

          if ((mediabox.Top - mediabox.Height) != 0)
            {
                topY = mediabox.Top;
                heightY = mediabox.Height;
                diffY = topY - heightY;
                lly_adjust = (topY - ury) + diffY;
                ury_adjust = (topY - lly) + diffY;
            }
            else if ((cropbox.Top - cropbox.Height) != 0)
            {
                topY = mediabox.Top;
                heightY = cropbox.Top;
                diffY = topY - heightY;
                lly_adjust = (topY - ury) - diffY;
                ury_adjust = (topY - lly) - diffY;

            }
            else
            {

                lly_adjust = mediabox.Top - ury;
                ury_adjust = mediabox.Top - lly;

            }

These are final adjustment done

Upvotes: 0

Bruno Lowagie
Bruno Lowagie

Reputation: 77528

The coordinate system in PDF is defined in ISO-32000-1. This ISO standard explains that the X-axis is oriented towards the right, whereas the Y-axis has an upward orientation. This is the default. These are the coordinates that are returned by iText (behind the scenes, iText resolves all CTM transformations).

If you want to transform the coordinates returned by iText so that you get coordinates in a coordinate system where the Y axis has a downward orientation, you could for instance subtract the Y value returned by iText from the Y-coordinate of the top of the page.

An example: Suppose that we are dealing with an A4 page, where the Y coordinate of the bottom is 0 and the Y coordinate of the top is 842. If you have Y coordinates such as y1 = 806 and y2 = 36, then you can do this:

y = 842 - y;

Now y1 = 36 and y2 = 806. You have just reversed the orientation of the Y-axis using nothing more than simple high-school math.

Update based on an extra comment:

Each page has a media box. This defines the most important page boundaries. Other page boundaries may be present, but none of them shall exceed the media box (if they do, then your PDF is in violation with ISO-32000-1).

The crop box defines the visible area of the page. By default (for instance if a crop box entry is missing), the crop box coincides with the media box.

In your comment, you say that you subtract llx from the height. This is incorrect. llx is the lower-left x coordinate, whereas the height is a property measured on the Y axis, unless the page is rotated. Did you check if the page dictionary has a /Rotate value?

You also claim that the values returned by iText do not match the values returned by PdfBox. Note that the values returned by iText conform with the coordinate system as defined by the ISO standard. If PdfBox doesn't follow this standard, you should ask the people from PdfBox why they didn't follow the standard, and what coordinate system they are using instead.

Maybe that's what mkl's comment is about. He wrote:

Y' = Ymax - Y. X' = X - Xmin.

Maybe PdfBox searches for the maximum Y value Ymax and the minimum X value Xmin and then applies the above transformation on all coordinates. This is a useful transformation if you want to render a PDF, but it's unwise to perform such an operation if you want to use the coordinates, for instance to add content at specific positions relative to text on the page (because the transformed coordinates are no longer "PDF" coordinates).

Remark:

You say you need PdfBox to get the text of a page. Why do you need this extra tool? iText is perfectly capable of extracting and reordering the text on a page (assuming that you use the correct extraction strategy). If not, please clarify.

Upvotes: 1

Related Questions