How do I improve the number detection for blueprints (OCR)

I have a number of blueprints where I would like to detect the numbers on the blueprint such that I can turn them into proper models. for example I have the following image and would like all the numbers on this image so I ran the following code:

import pytesseract
from pytesseract import Output
import cv2
import numpy as np
img = cv2.imread('vdb7C.jpg')

custom_config = r' (--oem 2 --psm 10'
d = pytesseract.image_to_data(img,config=custom_config,lang='eng', output_type=Output.DICT)

n_boxes = len(d['level'])
for i in range(n_boxes):
    text=d["text"][i]
    print(text+ str(str.isdigit(text)))
    if str.isdigit(text):
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)


cv2.imwrite("output.jpg" , img)

This gave me the following result: . As you can see it does correctly identify a number of numbers on the blueprint, however it misses quite a few others and falsely detect a few that aren't really there. I care more about getting all the numbers than a few false positives but would still like to keep those to a minimum so any suggestions there?

I have already tried thinning operations, re-scaling the images, rotating the images and smoothing the images but all of those don't appear to make much difference, extreme rescaling (*0.1 or *10) does change a few things but any gains made in one part of the image are undone by faults appearing in other parts.

Especially difficult are situations such as on the left building where we have lines numbers close to or even overlapping part of the design.
Here we see 2 examples of such situations

also note that font usage is not consistent between images.

It's worth noting that the lines are almost always obviously thinner then the fond used for the numbers so perhaps something could be done with that?

I have also tried using the EAST OCR system with the following code: img = cv2.imread('vdb7C.jpg')

W=5664
H=4000
dim = (W, H)
img = cv2.resize(img, dim, interpolation = cv2.INTER_AREA)
net = cv2.dnn.readNet("frozen_east_text_detection.pb")
blob = cv2.dnn.blobFromImage(img, 1.0, (W, H),
(123.68, 116.78, 103.94), swapRB=True, crop=False)
net.setInput(blob)
(scores, geometry) = net.forward(["feature_fusion/Conv_7/Sigmoid",
"feature_fusion/concat_3"])
(numRows, numCols) = scores.shape[2:4]
rects = []
confidences = []
# loop over the number of rows
for y in range(0, numRows):
    # extract the scores (probabilities), followed by the geometrical
    # data used to derive potential bounding box coordinates that
    # surround text
    scoresData = scores[0, 0, y]

    xData0 = geometry[0, 0, y]
    xData1 = geometry[0, 1, y]
    xData2 = geometry[0, 2, y]
    xData3 = geometry[0, 3, y]
    anglesData = geometry[0, 4, y]
    for x in range(0, numCols):

        if scoresData[x] < confidence:
            continue

        (offsetX, offsetY) = (x * 4.0, y * 4.0)

        angle = anglesData[x]
        cos = np.cos(angle)
        sin = np.sin(angle)

        h = xData0[x] + xData2[x]
        w = xData1[x] + xData3[x]

        endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
        endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
        startX = int(endX - w)
        startY = int(endY - h)

        rects.append((startX, startY, endX, endY))
        confidences.append(scoresData[x])
    boxes = non_max_suppression(np.array(rects), probs=confidences)
    for box in boxes:
        (y,h,x,w) = box
        print(box)
        print(np.shape(img))
        cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.imwrite("output.jpg" , img)

however this causes quite a number of bounding boxes to be outside of the image and in general the bounding boxes seem unrelated to the content, so anyone know what's up there? Any suggestions? I have 8000 images right now and need to eventually process a total of about 400k images.

Upvotes: 0

Answers (2)

RJ Adriaansen

Reputation: 9639

I suggest using a solution that applies neural networks like keras-ocr which applies CRAFT and CRNN. It does a better job in detecting text that overlaps with the design. This is what I got using it out of the box:

import matplotlib.pyplot as plt
import keras_ocr

detector = keras_ocr.detection.Detector()

image = keras_ocr.tools.read('vdb7C.jpg')
boxes = detector.detect(images=[image])[0]
canvas = keras_ocr.tools.drawBoxes(image, boxes)
plt.imshow(canvas)

Result:

Upvotes: 3

igrinis

Reputation: 13666

Run your tesseract piece of code, but only use results with 3 or more digits. This should provide you with enough good examples of digits. Extract each digit to separate file and save their positions. Now you can go two ways.

You can go the simple way if you will see that the fonts of the digits are quite similar. Then you can create a set of templates for the digits (say 15-30). Remember that you can get the size of the digits for specific image? Resize your digits template to the right size and run the most trivial template matching. This will definitely create some false detections (especially for "1"s), and you will have to find a way to reduce their amount to acceptable level.

More complex way is to build a custom CNN detector and train it on your data. So from the first stage you will get several hundred examples of digits (and their positions) that you want to detect. You can look at this project or this one as references. Also this article can provide you some guidance.

One more thing that can be useful. Your images have lots of long perpendicular lines. If you align them to the axis, you can remove the lines very easily by binarizing the original, shifting the result (right or down) by several pixels and ANDing them. This will left only the long lines. Find their length, and you will be able to remove lines above certain length in the original image.

Upvotes: 1

How do I improve the number detection for blueprints (OCR)

Answers (2)

Related Questions