Reputation: 719
I'm using the Google Vision API to extract the text from some pictures, however, I have been trying to improve the accuracy (confidence) of the results with no luck.
every time I change the image from the original I lose accuracy in detecting some characters.
I have isolated the issue to have multiple colors for different words with can be seen that words in red for example have incorrect results more often than the other words.
Example:
some variations on the image from gray scale or b&w
What ideas can I try to make this work better, specifically changing the colors of text to a uniform color or just black on a white background since most algorithms expect that?
some ideas I already tried, also some thresholding.
dimg = ImageOps.grayscale(im)
cimg = ImageOps.invert(dimg)
contrast = ImageEnhance.Contrast(dimg)
eimg = contrast.enhance(1)
sharp = ImageEnhance.Sharpness(dimg)
eimg = sharp.enhance(1)
Upvotes: 12
Views: 14204
Reputation: 258
import numpy as np
from skimage.morphology import selem
from skimage.filters import rank, threshold_otsu
from skimage.util import img_as_float
from PIL import ImageGrab
import matplotlib.pyplot as plt
def preprocessing(image, strelem, s0=30, s1=30, p0=.3, p1=1.):
image = rank.mean_bilateral(image, strelem, s0=s0, s1=s1)
condition = (lambda x: x>threshold_otsu(x))(rank.maximum(image, strelem))
normalize_image = rank.autolevel_percentile(image, strelem, p0=p0, p1=p1)
return np.where(condition, normalize_image, 0)
#Grab image from clipboard
image = np.array(ImageGrab.grabclipboard())
sel = selem.disk(4)
a = sum([img_as_float(preprocessing(image[:, :, x], sel, p0=0.3)) for x in range(3)])/3
fig, ax = plt.subplots(1, 2, sharey=True, sharex=True)
ax[0].imshow(image)
ax[1].imshow(rank.autolevel_percentile(a, sel, p0=.4))
This is my code for clearing text from noise and creating uniform brightness for characters. With minor modifications, I used it to solve your problem.
Upvotes: 0
Reputation: 1039
This is not a full solution but it may drive to something better.
By converting your data from BGR (or RGB) to CIE-Lab you can process a grayscale image as the weighted sum of the colour channels a* and b*. This grayscale image will enhance colour regions of the text. But adapting the threshold you can from this grayscale image segment the coloured word in your original image and get the other words from the a L channel thresholding. A bitwise and operator should be enough to merge to two segmentation image.
If you can have an image with a better contrast a very last step could be a filling based on the contours.
For that take a look to RETR_FLOODFILL
of the function 'cv2.findContours'.
Any other hole filing function from other package may also fit for that purpose.
Here is a code that show the first part of my idea.
import cv2
import numpy as np
from matplotlib import pyplot as plt
I = cv2.UMat(cv2.imread('/home/smile/QSKN.png',cv2.IMREAD_ANYCOLOR))
Lab = cv2.cvtColor(I,cv2.COLOR_BGR2Lab)
L,a,b = cv2.split(Lab)
Ig = cv2.addWeighted(cv2.UMat(a),0.5,cv2.UMat(b),0.5,0,dtype=cv2.CV_32F)
Ig = cv2.normalize(Ig,None,0.,255.,cv2.NORM_MINMAX,cv2.CV_8U)
#k = np.ones((3,3),np.float32)
#k[2,2] = 0
#k*=-1
#
#Ig = cv2.filter2D(Ig,cv2.CV_32F,k)
#Ig = cv2.absdiff(Ig,0)
#Ig = cv2.normalize(Ig,None,0.,255.,cv2.NORM_MINMAX,cv2.CV_8U)
_, Ib = cv2.threshold(Ig,0.,255.,cv2.THRESH_OTSU)
_, Lb = cv2.threshold(cv2.UMat(L),0.,255.,cv2.THRESH_OTSU)
_, ax = plt.subplots(2,2)
ax[0,0].imshow(Ig.get(),cmap='gray')
ax[0,1].imshow(L,cmap='gray')
ax[1,0].imshow(Ib.get(),cmap='gray')
ax[1,1].imshow(Lb.get(),cmap='gray')
Upvotes: 0
Reputation: 4236
You tried almost every standard step. I would advise you to try some PIL built-in filters like sharpness filter. Apply sharpness and contrast on the RGB image, then binarise it. Perhaps use Image.split() and Image.merge() to binarise each colour separately and then bring them back together. Or convert your image to YUV and then use just Y channel for further processing. Also, if you do not have a monochrome background consider performing some background substraction.
What tesseract likes when detecting scanned text is removed frames, so you can try to destroy as much of non character space from the image. (You might need to keep the picture size though, so you should replace it with white colour). Tesseract also likes straight lines. So some deskewing might be in order if your text is recorded at an angle. Tesseract also sometimes gives better results if you resize the image to twice its original size.
I suspect that Google Vision uses tesseract, or portions of it, but what other preprocessing it does for you I have no idea. So some of my advices here might actually be implemented already and doing them would be unnecessary and repetitive.
Upvotes: 1
Reputation: 57388
I can only offer a butcher's solution, potentially a nightmare to maintain.
In my own, very limited scenario, it worked like a charm where several other OCR engines either failed or had unacceptable running times.
My prerequisites:
What I did: - I measured the kerning width of each character. I only had A-Za-z0-9 and a bunch of punctuation characters to worry about. - The program would start at position (0,0), measure the average color to determine the color, then access the whole set of bitmaps generated from characters in all available fonts in that color. Then it would determine which rectangle was closest to the corresponding rectangle on the screen, and advance to the next one.
(Months later, requiring more performances, I added a varying probability matrix to test first the most likely characters).
In the end, the resulting C program was able to read the subtitles out of the video stream with 100% accuracy in real time.
Upvotes: 1
Reputation: 116
I need a little more context on this.
The image is far too blurry because of video compression, so even preprocessing the image to improve quality may not get the image quality high enough for accurate OCR. If you are set on OCR, one approach you could try:
Binarize the image to get the non-red text in white and background black as in your binarized image:
from PIL import Image
def binarize_image(im, threshold):
"""Binarize an image."""
image = im.convert('L') # convert image to monochrome
bin_im = image.point(lambda p: p > threshold and 255)
return bin_im
im = Image.open("game_text.JPG")
binarized = binarize_image(im, 100)
Extract only the red text values with a filter, then binarize it:
import cv2
from matplotlib import pyplot as plt
lower = [15, 15, 100]
upper = [50, 60, 200]
lower = np.array(lower, dtype = "uint8")
upper = np.array(upper, dtype = "uint8")
mask = cv2.inRange(im, lower, upper)
red_binarized = cv2.bitwise_and(im, im, mask = mask)
plt.imshow(cv2.cvtColor(red_binarized, cv2.COLOR_BGR2RGB))
plt.show()
However, even with this filtering, it still doesn't extract red well.
Add images obtained in (1.) and (2.).
combined_image = binarized + red_binarized
Upvotes: 0
Reputation: 393
You will need to pre-process the image more than once, and use a bitwise_or
operation to combine the results. To extract the colors, you could use
import cv2
boundaries = [ #BGR colorspace for opencv, *not* RGB
([15, 15, 100], [50, 60, 200]), #red
([85, 30, 2], [220, 90, 50]), #blue
([25, 145, 190], [65, 175, 250]), #yellow
]
for (low, high) in boundaries:
low = np.array(low, dtype = "uint8")
high = np.array(high, dtype = "uint8")
# find the colors within the specified boundaries and apply
# the mask
mask = cv2.inRange(image, low, high)
bitWise = cv2.bitwise_and(image, image, mask=mask)
#now here is the image masked with the specific color boundary...
Once you have the masked image, you can do another bitwise_or
operation on your to-be "final" image, essentially adding this mask to it.
but this specific implementation requires opencv, however the same principle applies for other image packages.
Upvotes: 0