sariDon
sariDon

Reputation: 7981

Embed/Insert/Add JSON OCR data generated by 'Google Cloud Vision (OCR)' inside a PDF file and make the PDF searchable

I am using Google Cloud Vision API (OCR) to detect text in PDF files using the PHP API Library. The OCR is done perfectly and I have saved the complete set of JSON output files (ex. output-1-to-2.json) with the full OCR data (which contains the positional details, confidence, full text etc).

Here is a sample JSON output (file: output-1-to-2.json) of a simple 2 paged PDF containing the words 'April' in page#1 and 'May' in page#2 (as images):

{
   "inputConfig":{
      "gcsSource":{
         "uri":"gs://my-ocr-bucket/php8723/sample.pdf"
      },
      "mimeType":"application/pdf"
   },
   "responses":[
      {
         "fullTextAnnotation":{
            "pages":[
               {
                  "property":{
                     "detectedLanguages":[
                        {
                           "languageCode":"en",
                           "confidence":1
                        }
                     ]
                  },
                  "width":595,
                  "height":841,
                  "blocks":[
                     {
                        "boundingBox":{
                           "normalizedVertices":[
                              {
                                 "x":0.0789916,
                                 "y":0.049940545
                              },
                              {
                                 "x":0.11596639,
                                 "y":0.049940545
                              },
                              {
                                 "x":0.11596639,
                                 "y":0.059453033
                              },
                              {
                                 "x":0.0789916,
                                 "y":0.060642093
                              }
                           ]
                        },
                        "paragraphs":[
                           {
                              "boundingBox":{
                                 "normalizedVertices":[
                                    {
                                       "x":0.0789916,
                                       "y":0.049940545
                                    },
                                    {
                                       "x":0.11596639,
                                       "y":0.049940545
                                    },
                                    {
                                       "x":0.11596639,
                                       "y":0.059453033
                                    },
                                    {
                                       "x":0.0789916,
                                       "y":0.060642093
                                    }
                                 ]
                              },
                              "words":[
                                 {
                                    "property":{
                                       "detectedLanguages":[
                                          {
                                             "languageCode":"en",
                                             "confidence":1
                                          }
                                       ]
                                    },
                                    "boundingBox":{
                                       "normalizedVertices":[
                                          {
                                             "x":0.0789916,
                                             "y":0.049940545
                                          },
                                          {
                                             "x":0.11596639,
                                             "y":0.049940545
                                          },
                                          {
                                             "x":0.11596639,
                                             "y":0.059453033
                                          },
                                          {
                                             "x":0.0789916,
                                             "y":0.060642093
                                          }
                                       ]
                                    },
                                    "symbols":[
                                       {
                                          "text":"A",
                                          "confidence":0.98833746
                                       },
                                       {
                                          "text":"p",
                                          "confidence":0.9870904
                                       },
                                       {
                                          "text":"r",
                                          "confidence":0.99477327
                                       },
                                       {
                                          "text":"i",
                                          "confidence":0.9951743
                                       },
                                       {
                                          "property":{
                                             "detectedBreak":{
                                                "type":"LINE_BREAK"
                                             }
                                          },
                                          "text":"l",
                                          "confidence":0.98942703
                                       }
                                    ],
                                    "confidence":0.9909605
                                 }
                              ],
                              "confidence":0.9909605
                           }
                        ],
                        "blockType":"TEXT",
                        "confidence":0.9909605
                     }
                  ],
                  "confidence":0.9909605
               }
            ],
            "text":"April"
         },
         "context":{
            "uri":"gs://my-ocr-bucket/php8723/sample.pdf",
            "pageNumber":1
         }
      },
      {
         "fullTextAnnotation":{
            "pages":[
               {
                  "width":595,
                  "height":841,
                  "blocks":[
                     {
                        "boundingBox":{
                           "normalizedVertices":[
                              {
                                 "x":0.0789916,
                                 "y":0.05469679
                              },
                              {
                                 "x":0.11092437,
                                 "y":0.05588585
                              },
                              {
                                 "x":0.11092437,
                                 "y":0.065398335
                              },
                              {
                                 "x":0.07731093,
                                 "y":0.064209275
                              }
                           ]
                        },
                        "paragraphs":[
                           {
                              "boundingBox":{
                                 "normalizedVertices":[
                                    {
                                       "x":0.0789916,
                                       "y":0.05469679
                                    },
                                    {
                                       "x":0.11092437,
                                       "y":0.05588585
                                    },
                                    {
                                       "x":0.11092437,
                                       "y":0.065398335
                                    },
                                    {
                                       "x":0.07731093,
                                       "y":0.064209275
                                    }
                                 ]
                              },
                              "words":[
                                 {
                                    "boundingBox":{
                                       "normalizedVertices":[
                                          {
                                             "x":0.0789916,
                                             "y":0.05469679
                                          },
                                          {
                                             "x":0.11092437,
                                             "y":0.05588585
                                          },
                                          {
                                             "x":0.11092437,
                                             "y":0.065398335
                                          },
                                          {
                                             "x":0.07731093,
                                             "y":0.064209275
                                          }
                                       ]
                                    },
                                    "symbols":[
                                       {
                                          "text":"M",
                                          "confidence":0.98251665
                                       },
                                       {
                                          "text":"a",
                                          "confidence":0.9763874
                                       },
                                       {
                                          "property":{
                                             "detectedBreak":{
                                                "type":"LINE_BREAK"
                                             }
                                          },
                                          "text":"y",
                                          "confidence":0.9850642
                                       }
                                    ],
                                    "confidence":0.98132277
                                 }
                              ],
                              "confidence":0.98132277
                           }
                        ],
                        "blockType":"TEXT",
                        "confidence":0.98132277
                     }
                  ],
                  "confidence":0.98132277
               }
            ],
            "text":"May"
         },
         "context":{
            "uri":"gs://my-ocr-bucket/php8723/sample.pdf",
            "pageNumber":2
         }
      }
   ]
}

Now I am stuck with embedding these OCR data (json files) in the PDF to make the PDF searchable. That means I need to edit the PDF and add the OCR data inside it.

My question: How to to insert/add the JSON formatted OCR data generated by Google Vision inside a PDF file and make the PDF searchable?

Than you for reading this so far.

PS: I have generated a crude HOCR file parsing the above JSON (using PHP json_decode) according to the HOCR standards (collected over available examples). Can this hocr file be embedded in PDF?

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='gcv2hocr' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
<div class="ocr_page" id="page_1" title="bbox 0 0 595 841; ppageno 0">
    <p class="ocr_par" id="par_0_0" lang="bn" title="bbox 47 41 69 51">
        <span class="ocrx_word" title="bbox 47 41 69 51; x_wconf 0">April</span> 
    </p>
</div>
<div class="ocr_page" id="page_2" title="bbox 0 0 595 841; ppageno 1">
    <p class="ocr_par" id="par_1_0" lang="bn" title="bbox 47 46 66 54">
        <span class="ocrx_word" title="bbox 47 46 66 54; x_wconf 0">May</span> 
    </p>
</div>
</body>
</html>

Upvotes: 1

Views: 161

Answers (1)

K J
K J

Reputation: 11857

JSON is an excessively woolly means to store PDF data if we remove all the easily deleted parts in a text editor. The roughly 10,600 nothings can be reduced to this more compact 600 bytes.

If confidence is not 95% it's probably not worth inclusion and as you can see at this point the whole lot could be even smaller.

Filename /sample.pdf

TextAnnotation
Page  /MediaBox[595 841]
"Area":[ {"x":47 "y":42} {"x":69 "y":42} {"x":69 "y":50} {"x":47 "y":51} ]
"symbols":[ { "text":"A" } { "text":"p" } { "text":"r" } { "text":"i" }
"property":{ "detectedBreak":{ "type":"LINE_BREAK" } } "text":"l" }
"text":"April"
"context":{ "uri":"sample.pdf", "pageNumber":1 }

TextAnnotation
Page  /MediaBox[595 841]
"Area":[{ "x":47 "y":46} { "x":66 "y":47} { "x":66 "y":55} { "x":46 "y":54}
"symbols": { "text":"M" } { "text":"a" } 
"property":{ "detectedBreak":{ "type":"LINE_BREAK" } }    "text":"y" }
"text":"May" 
"context":{ "uri":"sample.pdf", "pageNumber":2 }

Thus your programming simply needs to seek the "pageNumber":1 then see what was before that such as "text":"April" and use the values for the boundary which need convert to PDF units by multiply the positions by the MediaBox X,Y.

Result

Filename /sample.pdf

"pageNumber":1
Page  /MediaBox[595 841]

TextAnnotation
"text":"April"
"Height":8.5
"Position": {"x":47 "y":42} 

"pageNumber":2
Page  /MediaBox[595 841]

TextAnnotation
"text":"May" 
"Height":10
"Position":{"x":47 "y":46}

Now any library should be able to write those 2 entries, as 2 separate pages. The whole text output (as a PDF) should be only 931 bytes and that is far more than really needed.

enter image description here

%PDF-1.3
1 0 obj<</Lang (en-GB)/Pages 2 0 R/Type/Catalog/ViewerPreferences<</DisplayDocTitle true/Type/ViewerPreferences>>>>endobj
2 0 obj<</Count 2/Kids [ 3 0 R 4 0 R ]/Type/Pages>>endobj
3 0 obj<</Contents 6 0 R/MediaBox [ 0 0 595 841 ]/Parent 2 0 R/Resources<</Font<</F0 5 0 R>>>>/Type/Page>>endobj
4 0 obj<</Contents 7 0 R/MediaBox [ 0 0 595 841 ]/Parent 2 0 R/Resources<</Font<</F0 5 0 R>>>>/Type/Page>>endobj
5 0 obj<</BaseFont/CourierNew,Bold/Encoding/WinAnsiEncoding/Subtype/Type1/Type/Font>>endobj
6 0 obj<</Length 58>>stream
q 1 0 0 rg BT 100 Tz /F0 9 Tf 47 42 Td [(Apri)4(l)]TJ ET Q
endstream
endobj
7 0 obj<</Length 52>>stream
q 0 g BT 100 Tz /F0 9 Tf 47 46 Td [(Ma)12(y)]TJ ET Q
endstream
endobj

xref
0 8
0000000000 65536 f 
0000000009 00000 n 
0000000131 00000 n 
0000000189 00000 n 
0000000302 00000 n 
0000000415 00000 n 
0000000507 00000 n 
0000000611 00000 n 

trailer
<</Size 8/Root 1 0 R>>
startxref
710
%%EOF

Now you have the textual PDF you can use any command line tool to overstamp the text with the source images, and it should be better, if not as simple as any HOCR methodologies.

However by far the simplest is to do in page OCR so if we take the hOCR manual which has no text and run it through a single cross platform command we can get a good result.

Here I simply run:

mutool draw -F ocr.pdf  -o ocred.pdf "HOCR format.pdf"

And everything is done without additional writing multiple instructions. so by Default that massive 9.83 MB manual is converted into a much more efficient 15 searchable pages as only 1.83 MB.

enter image description here

Note the method may "seem" slow but you need to understand it is doing GigaFlops of steps to scan each pixelated band for single letters. Then search a dictionary to group those into word strings. Then bind those into a single line, and that will take time in any library, where image needs scanning for text shapes.

There is little text as its fairly sparse. 34.0 KB spread over 15 pages. and as with ALL OCR you need to run a local language spelling and grammar comparison check on the result. Objects like "Bullet Points" are a common problem.

enter image description here

Using GhostScript often produces better results (although both MuPDF and GhostScript are Artifex sister products). The command will be different and here is the colour variant.

gs -sDEVICE=pdfocr24 -o ocr24.pdf "HOCR Format.pdf"

The quality of output is higher and thus the file is larger (7.14 MB) but still smaller than the source (9.83 MB).

enter image description here

Upvotes: 0

Related Questions