Reputation: 5204
I have the following image of a table (pandas dataframe or excel sheet),
I just started using tesseract but I'm having problems converting it into a table.
I'm using the following code.
img_cv = cv2.imread(imagepath)
img_rgb = cv2.cvtColor(img_cv,cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb))
But words and letters are recognized but the formatting is all off and the words come out in a chunk and all jumbled.
'IN ETaat=) Count... Tkr & Exch Market Sales %ReventRelationshi Account %Cost Source As Of Date\n\nCap Surprise Value (Q) As Type\n\n21) Facebook Inc LUIS} las) LOS 516.19B) 0.93%\n\n39) Applied Optoelectro...|US AAOI US 177.83M 1.77% 10.90% 5.20M|\\CAPEX 0.14%|*2019A CF 02/28/2020\n40) Activision Blizzard ...|US ATVI US 46.13B 0.89%, 0.31%) 4.02M|COGS 0.13%|Estimate 12/03/2019\n41) Quanta Computer I... |TW 2382 an 7.93B| -2.73% 0.04% 3.02M/COGS 0.11%|Estimate 07/04/2019\n42) Modern Avenue Gro...|CN 002656 CH) 263.51M| -2.87%| 4.44% 2.60M|\\COGS 0.10%|*2018A CF 04/26/2019\n43) Mellanox Technolog...|IL MLNX US 6.51B| 13.57%| 0.74%) 2.80M|\\COGS (OM O}=1<1 tim [nate] k=) 03/03/2020\n44) O-Net Technologies...|CN 877 ale 463.33M aad 3.11%) 2.49M|CAPEX 0.07%|Estimate 10/30/2019\n45) Adobe Inc US ADBE US 162.75B 0.63%, 0.08% 2.02M|\\SG&A 0.07%|Estimate 06/12/2019\n46) British Land Co PLC...\\|GB BLND LN 5.74B| 10.97% 1.05% 2.12M\\SG&A (OM Oley atin [nat] k=) 11/19/2019\n47) Bel Fuse Inc US BELFA US | 123.22M) -3.66% 1.13% 1.40M/COGS (omer tl at-im [gate] k=) 11/19/2019\n48) Keysight Technolog...|US Nees US 17.99B 3.37%, 0.08% 880.90k/\\COGS (OM Oey a-imeat- 1K) 01/03/2020\n49) BT Group PLC GB BT/A LN 17.00B|} -0.01% 0.01% 631.65k/COGS (om OP2-1) at-1 8 [gate] K=) 01/16/2020\n50) KT Corp KR 030200 KS 5.21B 0.32%, 0.02% 1.07M|SG&A (om OP2-1) at-1 8 [gate] K=) 05/10/2019\n5D Sunny Optical Tech... |CN 2382 ale 18.16B aad 0.04% 425.69k/ COGS (om eM Rati m [nat] -) 08/27/2019\n52) Belden Inc US 131 D1@% US 1.95B 5.68%, 0.04%) 255.50k|COGS (om eM Rati m [nat] -) 11/04/2019\n53) Lattice Semiconduc... |US LSCC US 2.51B 0.24%, 0.18%) 174.54k COGS (om eM Rati m [nat] -) 05/08/2019\n54 Zhen Ding Technolo.../TW 4958 an 3.55B| -0.77%| 0.02%) 184.75k/COGS (om eM Rati m [nat] -) 01/17/2020\n55) Emnet Inc KR 123570 KS 66.79M aid Pa hei) 214.59k|SG&A *2019C3 CF 11/14/2019\n56) Zebra Technologies...|US ZBRA US 10.95B| -0.32% 57.18k\\COGS stim [eat] k=) 02/21/2020'
Is there a way to get it to a table format properly?
Upvotes: 3
Views: 14261
Reputation: 302
The only way to do this properly is to detect the vertical lines and use the coordinates of found lines to infer columns. Parsing the output is a road to nowhere, especially if you are hoping the lines will always be OCRd as pipes - they won't!
Upvotes: 1
Reputation: 643
In addition to mechanical_meat answer, you can format the output using the code below.
import cv2
import pytesseract
from pytesseract import Output
import pandas as pd
img = cv2.imread("HZ29h.png")
img = cv2.resize(img, (int(img.shape[1] + (img.shape[1] * .1)),
int(img.shape[0] + (img.shape[0] * .25))),
interpolation=cv2.INTER_AREA)
img_rgb = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
custom_config = r'-l eng --oem 3 --psm 6 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-:.$%./@& *"'
d = pytesseract.image_to_data(img_rgb, config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf != '-1') & (df.text != ' ') & (df.text != '')]
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num'] == block]
sel = curr[curr.text.str.len() > 3]
# sel = curr
char_w = (sel.width / sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left'] / char_w > prev_left + 1:
added = int((ln['left']) / char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
Output
IN vaate3 Count... Tkr & Exch Market Sales %ReventRelationshiAccount %Cost Source As Of Date
Cap Surprise Value Q As Type
21 Facebook Inc US FB US 516.19B 0.93%
39 Applied Optoelectro.../US AAOI US 177.83M 1.77% 10.90% 5.20MCAPEX om EE len key el 02/28/2020
40 Activision Blizzard ...US ATVI US 46.13B 0.89% 0.31% 4.02M/COGS 0.13% Estimate 12/03/2019
41 Quanta Computer I... TW 2382 TT 7.93B -2.73% 0.04% 3.02M COGS 0.11% Estimate 07/04/2019
42 Modern Avenue Gro...CN 002656 CH 263.51M -2.87% 4.44% 2.60MCOGS 0.10%*2018A CF 04/26/2019
43 Mellanox Technolog...JIL MLNX US 6.51B 13.57% 0.74% 2.80MCOGS 0.08%/Estimate 03/03/2020
44 O-Net Technologies...CN 877 HK 463.33M -- 3.11% 2.49MCAPEX 0.07%/Estimate 10/30/2019
45 Adobe Inc US ADBE US 162.75B 0.63% 0.08% 2.02M SG&A 0.07%/Estimate 06/12/2019
46 British Land Co PLC...GB BLND- LN 5.74B 10.97% 1.05% 2.12M SG&A 0.06%Estimate 11/19/2019
47 Bel Fuse Inc US BELFA US 123.22M -3.66% 1.13% 1.40MCOGS 0.04%Estimate 11/19/2019
48 Keysight Technolog...US 14s A Obed 17.99B 3.37% 0.08% 880.90k/COGS 0.03%Estimate 01/03/2020
49 BT Group PLC e 33 BT/A LN 17.00B -0.01% 0.01% 631.65k/COGS 0.02% Estimate 01/16/2020
50 KT Corp KR 030200 KS 5.21B 0.32% 0.02% 1.07M/SG&A 0.02% Estimate 05/10/2019
51 Sunny Optical Tech... CN 2382 HK 18.16B -- 0.04% 425.69k/COGS 0.01% Estimate 08/27/2019
52 Belden Inc US BDC US 1.95B 5.68% 0.04% 255.50k/COGS 0.01%/Estimate 11/04/2019
53 Lattice Semiconduc... US LscC US 2.51B 0.24% 0.18% 174.54k/COGS 0.01%/Estimate 05/08/2019
54. Zhen Ding Technolo.... TW 4958 TT 3.55B -0.77% 0.02% 184.75k/COGS 0.01%/Estimate 01/17/2020
55. Emnet Inc KR 123570 KS 66.79M -- 2.78% 214.59k/SG&A *2019C3 CF Wary esenke
56 Zebra Technologies.../US VAs 0a O hs 10.95B -0.32% 57.18k/COGS Estimate 02/21/2020
Upvotes: 4
Reputation: 169514
It's horizontally compressed so you can resize the height dimension and it mostly works; I augmented the vertical dimension by ~25%, and added ~10% to the horizontal dimension.
img_resized = cv2.resize(img_cv,
(int(img_cv.shape[1] + (img_cv.shape[1] * .1)),
int(img_cv.shape[0] + (img_cv.shape[0] * .25))),
interpolation=cv2.INTER_AREA)
img_rgb = cv2.cvtColor(img_resized,cv2.COLOR_BGR2RGB)
Result:
In [42]: print(pytesseract.image_to_string(img_rgb))
vente) Count... Tkr & Exch Market Sales %ReventRelationshiAccount %Cost Source As Of Date
Cap Surprise Value (Q) As Type
21) Facebook Inc US FB US 516.19B) 0.93%
39) Applied Optoelectro...|US AAOI US | 177.83M| 1.77%| 10.90% 5.20M|\CAPEX 0.14%|*2019A CF 02/28/2020
40) Activision Blizzard ...|US ATVI US 46.13B) 0.89% 0.31% 4.02M|\COGS 0.13%|/Estimate 12/03/2019
41) Quanta Computer I... |TW 2382 TT 7.93B| -2.73%| 0.04% 3.02M COGS 0.11%|/Estimate 07/04/2019
42) Modern Avenue Gro... |CN 002656 CH! 263.51M -2.87%| 4.44% 2.60M|\COGS 0.10%|*2018A CF 04/26/2019
43) Mellanox Technolog...|IL MLNX US 6.51B) 13.57%, 0.74% 2.80M|COGS 0.08%|/Estimate 03/03/2020
44) O-Net Technologies...|CN 877 HK | 463.33M --| 3.11% 2.49M\CAPEX 0.07%|Estimate 10/30/2019
45) Adobe Inc US ADBE US| 162.75B) 0.63%, 0.08% 2.02M SG&A 0.07%|Estimate 06/12/2019
46) British Land Co PLC...|GB BLND- LN 5.74B) 10.97%, 1.05% 2.12M SG&A 0.06%|Estimate 11/19/2019
47) Bel Fuse Inc US BELFA US | 123.22M -3.66%| 1.13% 1.40M|\COGS 0.04%|Estimate 11/19/2019
48) Keysight Technolog...|US KEYS US 17.99B| 3.37% 0.08% 880.90k|COGS 0.03%|Estimate 01/03/2020
49) BT Group PLC GB BT/A LN 17.00B| -0.01%| 0.01% 631.65k/COGS 0.02%|/Estimate 01/16/2020
50) KT Corp aoe 030200 KS 5.21B) 0.32% 0.02% 1.07M|SG&A 0.02%|/Estimate 05/10/2019
51) Sunny Optical Tech... |CN 2382 HK 18.16B --| 0.04% 425.69k/COGS 0.01%|/Estimate 08/27/2019
52) Belden Inc US BDC US 1.95B) 5.68% 0.04% 255.50k/|COGS 0.01%|/Estimate 11/04/2019
53) Lattice Semiconduc...|US Lscc US 2.51B) 0.24% 0.18% 174.54k|COGS 0.01%|/Estimate 05/08/2019
54) Zhen Ding Technolo..., TW 4958 TT 3.55B) -0.77%| 0.02% 184.75k/COGS 0.01%|/Estimate 01/17/2020
55) Emnet Inc KR 123570 KS| 66.79M --| 2.78% 214.59k/SG&A *2019C3 CF Wary esenke,
56) Zebra Technologies...|US ZBRA US 10.95B) -0.32% 57.18k|COGS Estimate 02/21/2020
To write this to an output file do:
output = pytesseract.image_to_string(img_rgb)
with open('test.csv','w') as f:
f.write(output)
Upvotes: 3