BHARATH
BHARATH

Reputation: 71

Remove color from an image

I want to remove the color from the below image, due to this color I am unable to extract the text clearly from the image.

enter image description here

I am using the below code, but I am not getting the clear text,

import numpy as np
from PIL import Image

im = Image.open('my_file.tif')
im = im.convert('RGBA')
data = np.array(im)
# just use the rgb values for comparison
rgb = data[:,:,:3]
color = [246, 213, 139]   # Original value
black = [0,0,0, 255]
white = [255,255,255,255]
mask = np.all(rgb == color, axis = -1)
# change all pixels that match color to white
data[mask] = white

# change all pixels that don't match color to black
##data[np.logical_not(mask)] = black
new_im = Image.fromarray(data)
new_im.save('new_file.tif')

and

def black_and_white(input_image_path,
                output_image_path):
color_image = Image.open(input_image_path)
bw = color_image.convert('L')
bw.save(output_image_path)

Please help me with this...

Image 2:

enter image description here

Upvotes: 2

Views: 4520

Answers (2)

nathancy
nathancy

Reputation: 46600

I'm assuming you want to extract the quote. To do this, you can do a series of filtering operations to remove non-text contours. Once you have the processed result you can use an OCR tool such as Pytesseract for text extraction.

enter image description here

Result from OCR

On behalf of the hundreds of ACLU activists who
called on Governor Walker to veto House Bill
156, we are disappointed that he did not put
students or the Constitution first today.”
—Joshua A. Decker
Executive Director

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image and threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

# Connect text with a horizontal shaped kernel
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (10,3))
dilate = cv2.dilate(thresh, kernel, iterations=3)

# Remove non-text contours using aspect ratio filtering
cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    aspect = w/h
    if aspect < 3:
        cv2.drawContours(thresh, [c], -1, (0,0,0), -1)

# Invert image and OCR
result = 255 - thresh
data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
print(data)

cv2.imshow('result', result)
cv2.waitKey()

Upvotes: 3

Lukashou-AGH
Lukashou-AGH

Reputation: 123

Try OpenCV conversion, but remember to use 3 chanels, otherwise you'll get error

gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

Upvotes: 0

Related Questions