r_31415
r_31415

Reputation: 8982

How does OCR work in Google Drive?

I have noticed that Google Drive recognizes text in PDFs (and other files such as images and text documents). Out of curiosity, I want to know what did they do to show selectable and searchable img tags. For instance, when I inspect a Google Drive document in Chrome Developer Tools, each page is an image but it doesn't behave as an image because the text is selectable. On the other hand, when I zoom in, it seems like another image with higher resolution is loaded. I think that's the same trick that scribd is using.

I also read that the Google has been improving tesseract-ocr and that the Google Books team helped with the OCR implementation in Google Drive, but I'm not sure what is the process to generate img tags in the way they are doing it.

What is going on behind scenes?

Thanks!

Upvotes: 3

Views: 3261

Answers (2)

Whippet
Whippet

Reputation: 322

There are two basic methods used for OCR: Matrix matching and feature extraction. Of the two ways to recognize characters, matrix matching is the simpler and more common.

Matrix Matching compares what the OCR scanner sees as a character with a library of character matrices or templates. When an image matches one of these prescribed matrices of dots within a given level of similarity, the computer labels that image as the corresponding ASCII character.

Feature Extraction is OCR without strict matching to prescribed templates. Also known as Intelligent Character Recognition (ICR), or Topological Feature Analysis, this method varies by how much "computer intelligence" is applied by the manufacturer. The computer looks for general features such as open areas, closed shapes, diagonal lines, line intersections, etc. This method is much more versatile than matrix matching. Matrix matching works best when the OCR encounters a limited repertoire of type styles, with little or no variation within each style. Where the characters are less predictable, feature, or topographical analysis is superior.

If you'd like to know more, go to: http://www.dataid.com/aboutocr.htm

Upvotes: 0

user568109
user568109

Reputation: 47993

I cant be sure what happens exactly, but I will put my findings to you. If you look into the HTML code for pdf view of file in your drive, you will find something like this.

<div id="page-pane" class="">
   <div id=":2h.page.0" class="page-element goog-inline-block" style="width: 820px;">
      <div>
         <div class="highlight-pane"></div>
         <div class="highlight-pane">
            <div class="highlight selection-highlight" style="left: 154px; top: 142px; width: 268px; height: 13px;"></div>
            <div class="highlight selection-highlight" style="left: 105px; top: 164px; width: 73px; height: 14px;"></div>
            <div class="highlight selection-highlight" style="left: 154px; top: 181px; width: 128px; height: 13px;"></div>
         </div>
         <div class="highlight-pane"></div>
         <div class="highlight-pane"></div>
         <img class="page-image" style="width: 800px; height: 1131px; display: none;" src="https://docs.google.com/file/d/0BzxfQAgMGNM6VGg4RFlBZkdoOWM/image?pagenumber=1&amp;w=138" /><img class="page-image" style="width: 800px;" src="https://docs.google.com/file/d/0BzxfQAgMGNM6VGg4RFlBZkdoOWM/image?pagenumber=1&amp;w=800" />
         <p id=":2h.a11y.0" class="accessibility-text" tabindex="-1"></p>
      </div>
   </div>

There are four highlight-pane div and an img div within 2h.page.0 (page 0 of pdf). The img div shows the image which you talk about. This is just a plain simple image, no OCR here. The selected text you mention is from the second highlight-pane which has divs added to it dynamically when you drag a box on the image. The three divs within second highlight-pane represent the selected text (which corresponds to three lines of selected text).

The following happens when you visit page.

  • View image of the page from the pdf stored in your drive.
  • You select something on the page. You create a dragbox.
  • The selection triggers javascript which runs OCR on pdf (OCR output could be already computed).
  • The output of OCR is added to the div inside highlight-pane div

Upvotes: 3

Related Questions