Reputation: 13170
I have a bunch of pre-processed tables that looks similar to this one:
After playing for a while with the parameters, I have found that this command gives me decent results:
tesseract my_img.png out -c tessedit_char_whitelist="0123456789.E%-" --psm 6
Unfortunately, not good enough for my needs. Note how some columns are not separated in the output, and some minus sign is missing.
What can I do to improve the results?
0.015 1.0010.623 0.09911.850.0272 0.1% 4.0 0.03%
0.020 1.0030.304 0.3211404-0.2144 0.0% 4.0 0.02%
0.030 1.0080.370 0.26214.040.1718 0.1% 3.0 0.06%
0.040 1.0170.393 0.23814.150.1412 0.2% 0.5 0.10%
0.050 1.0300.408 0.22813.76-0.1346 0.4% 0.5 0.17%
0.060 1.9031.408 0.08518.32-0.0988 15.2% 40. 7.47%
0.080 1.7390.609 0.23516.120.2033 2.2% 35. 1.23%
0.1001.6480.242-0.00619.35 0.0590 0.4% 0.5 0.17%
0.150 1.4330.076 0.62913.32-0.3336 1.5% 2 0.75%
0.2001.4880.148 0.47913.91-0.2602 2.2% 0.5 0.96%
0.3001.664-0.303 0.31614.000.2044 2.8% 0.5 1.25%
0.400-1.883.-0.408 0.24213.70-0.1576 3.0% 0.5 1.40%
0.5002.022-0.516 0.18613.77.-0.1282 3.6% 0.5 1.60%
0.6001.9750.625 0.13413.80-0.0948 3.0% 0.5 1.38%
0.8002.0540.709 0.10113.64-0.0763 2.8% 0.5 1.34%
1.00 2.0250.790 0.07414.55-0.0629 2.6% 0.5 1.28%
1.50 1.8990.912 0.03313.360.0360 1.2% 5 0.72%
2.00 1.7950.889 0.049-13.34-0.0585 2.5% 0.5 1.35%
3.00 1.6250.866 0.06813.44-0.0887 6.3% 0.5 2.67%
4.00 1.4900.854 0.08113.71-0.1057 8.0% 0.5 3.34%
5.00 1.6160.713 0.14514.15--0.1708 7.7% 0.5 4.29%
6.00 1.4820.828 0.10014.26-0.1177 11.6% 0.5 4.23%
8.00 1.4660.820 0.11614.21-0.1362 9.0% 0.5 3.85%
10.00 1.433-0.938 0.08714.14-01117 8.2% 0.5 3.54%
15.00 14151.120 0.06013.92-0.0949 7.0 0.5 3.26%
Upvotes: 3
Views: 2771
Reputation: 13170
I solved the problem by using opencv
and pytesseract
. My solution was inspired by this answer.
Key to this procedure is to have well pre-processed images. In the image of the original question, you can see a small spot of black pixel just before the last column on the right. Those spots must be cleared out! Since I only had a few tables with that problem, I used GIMP.
Detect the columns in the table, by applying a dilation step to the inverted gray image. By choosing an appropriate number of iterations, the columns take shape, and it is also possible to spot the minus signs.
with cv2.boundingRect(cnt)
it is possible to crop out the single columns and save them to disk.
Apply pytesseract
to the different columns, with the same options as in the original question.
To detect the minus signs: I was lucky enough that only the 4-th and 6-th columns presented minus signs, and my tables had exactly 25 rows (therefore, height of each row = height of the image / 25). So, take those columns, crop the the column to have say 40px width (this is a guess, based on trial and error). The crop should now have a few white rectangles where the minus signs are located. Detect the contours of these rectangles. Compute the centroid of each contour. The y-coordinate of the centroid is used to find the number of the row in which the minus sign is located. Apply corrections (where needed) to the OCR results.
Combine the different columns into a CSV file.
EDIT: with this procedure I got about 98.5% accuracy.
Upvotes: 3