pseudorandom
pseudorandom

Reputation: 152

R pdftools returning different units for PDF text coordinates

In the package pdftools, there are two functions pdf_data() (which works on pre-OCR'd PDF files) and pdf_ocr_data() (which will OCR a PDF file regardless of whether it is already OCR'd or not).

Per Using the pdf_data function from the pdftools package efficiently I have confirmed that the pdf_data() x and y fields are the coordinates based off of the distance from the top left corner, which is 0,0. However, I'm not sure what the units are.

The documentation for pdf_ocr_data() function explains the function's arguments, but not the output. While word and confidence seem relatively self explanatory, I’m getting stuck on figuring out what the elements of bbox are. It seems to have something to do with the word coordinates, however the ex. output I provided above are the results for the first 2 words of the same PDF page and as you can see the results are different. In my testing it seems like the first two values of bbox relate to the x and y, respectively, with ratios of bbox[1]/x and bbox[2]/y ranging from 8.4 to just over 9.

So my questions are as follows:

  1. What are the units of x and y as provided by pdf_data()?
  2. What are the values displayed in the bbox field produced by pdf_ocr_data?
  3. How are the values related, precisely?

P.S. I have not provided reproduceable code here since these are questions regarding the nature of the output and not questions regarding troubleshooting or how to solve an issue. There is a forum post here: https://discuss.ropensci.org/t/text-vs-word-xy-coordinate-differences-between-pdf-data-and-pdf-ocr-data/3518 asking similar questions, but there are no replies to that post.

Upvotes: 1

Views: 313

Answers (1)

K J
K J

Reputation: 11941

If we take the example Data set we can see that in that case the values are the Top Left corners of the Text Tiles so here we see the pdf_data is 154 x 139

enter image description here

This implies the text is an Em size of 8
However if we inspect the source PDF we see the real values are that the text is 9.9626 points at a scale, unknown to us (without knowing what a unit is since units are not simple constants in a PDF). Thus we can surmise the file is designed for a media of [0 0 612 792], which means the origin is defaulting to lower left and intended to be used as if American Letter size.

BT
/F8 9.9626 Tf 154.69 646.077 Td [(Mazda)-333(RX4)]TJ
ET
BT
/F8 9.9626 Tf 154.69 633.724 Td [(Mazda)-333(RX4)-334(W)84(ag)]TJ
ET
BT
/F8 9.9626 Tf 154.69 621.37 Td [(Datsun)-333(710)]TJ
ET
BT
/F8 9.9626 Tf 154.69 609.016 Td [(Hornet)-333(4)-334(Driv)28(e)]TJ
ET

If we inspect Font 8 we see it is Computer Modern 10 point thus we can say the scale of those letters is 99.6% of their original true scale. and if we multiply 154.69 by that ratio we get the above reported 154 units from Left.

For the height we can subtract 139 from 792 so top left is @ 653 above origin and If we take lower left as at 646.077 there is an awkward differential as 6.923 which is not near the letters height, thus we can pre-sume the scaling difference between two unit bases is again at play.

For OCR the work areas for single letters or grouped as part words or full words can vary considerably since glyph less text will not be as consistently scaled compared to a font with glyphs.

The height of a theoretic bounding box may not reflect the true height, as it could be 1 point high for a 10 point character. Thus OCR co-ordinates should always be compared to the expected source output within the resultant file.

enter image description here

Now if we compare above apples with OCR pears, none of the scales will be the same, as they are approximations.

enter image description here

Upvotes: 0

Related Questions