R pdftools returning different units for PDF text coordinates

Question

In the package pdftools, there are two functions pdf_data() (which works on pre-OCR'd PDF files) and pdf_ocr_data() (which will OCR a PDF file regardless of whether it is already OCR'd or not).

pdf_data() results in a list of tibbles, each with 6 fields: width, height, x, y, space, and text. Ex output:

       // A tibble: 2 x 6
       width  height  x      y      space  text      
                        
1      51     12      15     65     TRUE   Text1   
2      59     12      70     65     FALSE  Text2

pdf_ocr_data() results in a list of tibbles with 3 fields: word, confidence, and bbox. Ex output:

       //A tibble: 2 x 3
       word   confidence bbox             
                           
1      Text1  96.8       136,546,551,647  
2      Text2  96.7       590,545,1078,625

Per Using the pdf_data function from the pdftools package efficiently I have confirmed that the pdf_data() x and y fields are the coordinates based off of the distance from the top left corner, which is 0,0. However, I'm not sure what the units are.

The documentation for pdf_ocr_data() function explains the function's arguments, but not the output. While word and confidence seem relatively self explanatory, I’m getting stuck on figuring out what the elements of bbox are. It seems to have something to do with the word coordinates, however the ex. output I provided above are the results for the first 2 words of the same PDF page and as you can see the results are different. In my testing it seems like the first two values of bbox relate to the x and y, respectively, with ratios of bbox[1]/x and bbox[2]/y ranging from 8.4 to just over 9.

So my questions are as follows:

What are the units of x and y as provided by pdf_data()?
What are the values displayed in the bbox field produced by pdf_ocr_data?
How are the values related, precisely?

P.S. I have not provided reproduceable code here since these are questions regarding the nature of the output and not questions regarding troubleshooting or how to solve an issue. There is a forum post here: https://discuss.ropensci.org/t/text-vs-word-xy-coordinate-differences-between-pdf-data-and-pdf-ocr-data/3518 asking similar questions, but there are no replies to that post.

R pdftools returning different units for PDF text coordinates

Answers (1)

Related Questions