Reputation: 152
In the package pdftools
, there are two functions pdf_data()
(which works on pre-OCR'd PDF files) and pdf_ocr_data()
(which will OCR a PDF file regardless of whether it is already OCR'd or not).
pdf_data()
results in a list of tibbles, each with 6 fields: width, height, x, y, space, and text. Ex output:
// A tibble: 2 x 6
width height x y space text
<int> <int> <int> <int> <lgl> <chr>
1 51 12 15 65 TRUE Text1
2 59 12 70 65 FALSE Text2
pdf_ocr_data()
results in a list of tibbles with 3 fields: word, confidence, and bbox. Ex output:
//A tibble: 2 x 3
word confidence bbox
<chr> <dbl> <chr>
1 Text1 96.8 136,546,551,647
2 Text2 96.7 590,545,1078,625
Per Using the pdf_data function from the pdftools package efficiently I have confirmed that the pdf_data()
x and y fields are the coordinates based off of the distance from the top left corner, which is 0,0. However, I'm not sure what the units are.
The documentation for pdf_ocr_data()
function explains the function's arguments, but not the output. While word and confidence seem relatively self explanatory, I’m getting stuck on figuring out what the elements of bbox are. It seems to have something to do with the word coordinates, however the ex. output I provided above are the results for the first 2 words of the same PDF page and as you can see the results are different. In my testing it seems like the first two values of bbox relate to the x and y, respectively, with ratios of bbox[1]/x and bbox[2]/y ranging from 8.4 to just over 9.
So my questions are as follows:
pdf_data()
?pdf_ocr_data
?P.S. I have not provided reproduceable code here since these are questions regarding the nature of the output and not questions regarding troubleshooting or how to solve an issue. There is a forum post here: https://discuss.ropensci.org/t/text-vs-word-xy-coordinate-differences-between-pdf-data-and-pdf-ocr-data/3518 asking similar questions, but there are no replies to that post.
Upvotes: 1
Views: 313
Reputation: 11941
If we take the example Data set we can see that in that case the values are the Top Left corners of the Text Tiles so here we see the pdf_data is 154 x 139
This implies the text is an Em size of 8
However if we inspect the source PDF we see the real values are that the text is 9.9626 points at a scale, unknown to us (without knowing what a unit is since units are not simple constants in a PDF). Thus we can surmise the file is designed for a media of [0 0 612 792], which means the origin is defaulting to lower left and intended to be used as if American Letter size.
BT
/F8 9.9626 Tf 154.69 646.077 Td [(Mazda)-333(RX4)]TJ
ET
BT
/F8 9.9626 Tf 154.69 633.724 Td [(Mazda)-333(RX4)-334(W)84(ag)]TJ
ET
BT
/F8 9.9626 Tf 154.69 621.37 Td [(Datsun)-333(710)]TJ
ET
BT
/F8 9.9626 Tf 154.69 609.016 Td [(Hornet)-333(4)-334(Driv)28(e)]TJ
ET
If we inspect Font 8 we see it is Computer Modern 10 point thus we can say the scale of those letters is 99.6% of their original true scale. and if we multiply 154.69 by that ratio we get the above reported 154 units from Left.
For the height we can subtract 139 from 792 so top left is @ 653 above origin and If we take lower left as at 646.077 there is an awkward differential as 6.923 which is not near the letters height, thus we can pre-sume the scaling difference between two unit bases is again at play.
For OCR the work areas for single letters or grouped as part words or full words can vary considerably since glyph less text will not be as consistently scaled compared to a font with glyphs.
The height of a theoretic bounding box may not reflect the true height, as it could be 1 point high for a 10 point character. Thus OCR co-ordinates should always be compared to the expected source output within the resultant file.
Now if we compare above apples with OCR pears, none of the scales will be the same, as they are approximations.
Upvotes: 0