shantanuo
shantanuo

Reputation: 32304

Reading image for OCR

I followed the OCR package and it works with default test image. But once I change the image, I get an error.

https://github.com/Breta01/handwriting-ocr/blob/master/OCR.ipynb

If I disable this line, code is executed but text is not correctly read for obivous reasons.

crop = page.detection(image)

The details are:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-66-869a5b4b76fb> in <module>()
      1 # Crop image and get bounding boxes
----> 2 crop = page.detection(image)
      3 implt(image)
      4 bBoxes = words.detection(image)

~/SageMaker/handwriting-ocr/ocr/page.py in detection(image)
     17                                    np.ones((5, 11)))    
     18     # Countours
---> 19     pageContour = findPageContours(closedEdges, resize(image))
     20     # Recalculate to original scale
     21     pageContour = pageContour.dot(ratio(image))

~/SageMaker/handwriting-ocr/ocr/page.py in findPageContours(edges, img)
     94 
     95     # Sort corners and offset them
---> 96     pageContour = fourCornersSort(pageContour[:, 0])
     97     return contourOffset(pageContour, (-5, -5))
     98 

~/SageMaker/handwriting-ocr/ocr/page.py in fourCornersSort(pts)
     47 def fourCornersSort(pts):
     48     """ Sort corners: top-left, bot-left, bot-right, top-right"""
---> 49     diff = np.diff(pts, axis=1)
     50     summ = pts.sum(axis=1)
     51     return np.array([pts[np.argmin(summ)],

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/numpy/lib/function_base.py in diff(a, n, axis)
   1922     slice1 = [slice(None)]*nd
   1923     slice2 = [slice(None)]*nd
-> 1924     slice1[axis] = slice(1, None)
   1925     slice2[axis] = slice(None, -1)
   1926     slice1 = tuple(slice1)

IndexError: list assignment index out of range

I am expecting this to work because I have handwritten documents to be imported and most of the (non-ML) softwares are not able to read them correctly.


Update:

Let's assume there are 100 employees in a company who will submit handwritten documents. Does it mean that I need to collect sample handwriting of all 100 individuals to train the model?


Update 1:

Maybe I have not explained my problem correctly. I have an image:

https://s3.amazonaws.com/todel162/harshad_college_card.jpg

The tessaract OCR fails to read it correctly. As seen in this text file - name, Standard and date of birth is missing (that is most important)

https://s3.amazonaws.com/todel162/college_card_reading.txt

Is there any package (with or without ML) that can read printed and hand-written text from a single document that may be scanned with different resolutions / sizes (by the end-users)

Upvotes: 3

Views: 730

Answers (3)

Him
Him

Reputation: 5549

This is a hard problem. So set your expectations of effort to accuracy ratio accordingly.

That said, though the process has a number of challenges, it is not impossible. Here is one possible solution pipeline:

Challenge 1) Figure out where the DOB, name, etc fields are in the image.

-- Since the image is taken by users, it may be of different resolution, at a variety of angles, and with a variety of lighting. However, this relationship is more or less captured by an affine transformation and an appropriate color space. What we want is to figure out an affine transformation that maps the user image onto some standard id card image that we have... then we can just use an x,y box to find the locations of the relevant fields.

---- Process: Map the images into a lighting-robust color space. Find the affine transformation that, when applied to the user-taken image, minimizes the distance between the transformed user-image and the standard image. Apply that affine transformation to the user image. This is now in standard format.

Challenge 2) Apply OCR, but don't take OCR's word for it

-- machine-readable human handwriting is not a solved problem. i.e. your software WILL have problems, and you may need to account for them. That said, if your software is giving you good enough results, then cool, but I suspect you'll want to do some work to check the results.

---- Process: Create a human-labeled validation set, so that you can determine the accuracy, precision and recall of this whole pipeline. You'll want to know how good your process is performing. Also, you should have a series of sanity checks on the results. For example, DOB must take the form of a date.
If the machine-readable version is not in the form of a date, it is wrong, and should be added to a queue for human review. Names should match a dictionary of names, etc. Point, is that the OCR process WILL NOT be perfect, and you'll need to figure out how to account for that.

Upvotes: 2

Anand C U
Anand C U

Reputation: 915

Since the applications you are developing runs OCR on highly structures documents like ID Cards, you can read the required fields in 2 steps.

1) Crop out different regions of the image which are important to you. E.g., DOB, Name, etc. (Regions are hard coded for a given type of document)

2) Use the cropped image to detect the handwritten text.

Upvotes: 2

A. STEFANI
A. STEFANI

Reputation: 6736

I expect that it is because there is a line under the handwriting text, it may be the source of the error. Because it may looks like a captcha... And tesseract will ended to detect letter instead of number.

In my opinion, there is two possibilties

  • Try to pre-process the image (with a color filter) to remove all native document underline in your picture.

  • Crop the image to get the birthday block only, then specify that you are looking only for number with tessedit_char_whitelist=0123456789 argument. Which give you a command like that: tesseract birthday_only.png stdout -c tessedit_char_whitelist=0123456789

Upvotes: 2

Related Questions