Reputation: 33
original image
img = cv2.imread('eng2.png')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['level'])
for i in range(n_boxes):
(x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
plt.figure(figsize=(10,10))
plt.imshow(img)
The above code produces this image. Now in the image there are two coordinates one for each word and other for the whole text. I would like to get the coordinates for the whole text (sentences in each line or the whole paragraph
This is what I have tried
box = pd.DataFrame(d) #dict to dataframe
box['text'].replace('', np.nan, inplace=True) #replace empty values by NaN
box= box.dropna(subset = ['text']) #delete rows with NaN
print(box)
def lineup(boxes):
linebox = None
for _, box in boxes.iterrows():
if linebox is None: linebox = box # first line begins
elif box.top <= linebox.top+linebox.height: # box in same line
linebox.top = min(linebox.top, box.top)
linebox.width = box.left+box.width-linebox.left
linebox.heigth = max(linebox.top+linebox.height, box.top+box.height)-linebox.top
linebox.text += ' '+box.text
else: # box in new line
yield linebox
linebox = box # new line begins
yield linebox # return last line
lineboxes = pd.DataFrame.from_records(lineup(box))
Output dataframe
n_boxes = len(lineboxes['level'])
for i in range(n_boxes):
(x, y, w, h) = (lineboxes['left'][i], lineboxes['top'][i], lineboxes['width'][i], lineboxes['height'][i])
cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
plt.figure(figsize=(10,10))
plt.imshow(img)
There seems to be no difference between the original coordinates and after joining all the coordinates
How can i get the coordinates of the whole text (sentences in each line or the whole paragraph) using pytesseract library?
Upvotes: 1
Views: 3442
Reputation: 21203
You faced a similar issue in one of your previous questions linked here. I failed to elaborate what I meant in the comments. Here is a more visual explanation.
By horizontal kernel I meant an array with single row [1, 1, 1, 1, 1]
. The number of columns can be determined based on the font size and space between characters/words. Using the kernel with a morphological dilation operation you can connect individual entities that are present horizontally as a single entity.
In your case, we would like to extract each line as an individual entity. Let's go through the code:
Code:
img = cv2.imread('letter.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# inverse binary image, to ensure text region is in white
# because contours are found for objects in white
th = cv2.threshold(gray,0,255,cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
Now there is a black border surrounding the original image. In th
it becomes are white border. Since it is unwanted we will remove it using cv2.floodFill()
black = np.zeros([img.shape[0] + 2, img.shape[1] + 2], np.uint8)
mask = cv2.floodFill(th.copy(), black, (0,0), 0, 0, 0, flags=8)[1]
# dilation using horizontal kernel
kernel_length = 30
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length, 1))
dilate = cv2.dilate(mask, horizontal_kernel, iterations=1)
img2 = img.copy()
contours = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0] if len(contours) == 2 else contours[1]
for c in contours:
x, y, w, h = cv2.boundingRect(c)
img2 = cv2.rectangle(img, (x, y), (x + w, y + h), (0,255,0), 2)
You can get the coordinates for each line from cv2.boundingRect()
. This can be seen in the image above. Using those coordinates you can crop each line in the document and feed it to pytesseract
library.
Upvotes: 2