Reputation: 1830
I am using Java - Tess4j-5.13.0.jar to read a pdf containing a table like image. Its the first time using Tess4j/tesseract.
Tess4j is located here : https://github.com/nguyenq/tess4j
The pdf I am trying to convert : https://drive.google.com/file/d/1sd64gFL0A4nHAJmiekkmEwvpC2tCsLNT/view?usp=sharing
The problem is when the pdf image is processed it only returns the first heading line and the rest is ignored.
The pdf contains one image that looks like a table with a heading. The heading is returned but the rest of the table is ignored. One extra string is also returned but I do not know where that comes from. "-ma_———"
This is my code that I used.
public static void main(String[] args) throws IOException, TesseractException {
// TODO Auto-generated method stub
File imageFile = new File("C:/Users/DFDS_Y1_2025.pdf");
ITesseract instance = new Tesseract(); // JNA Interface Mapping
instance.setDatapath("C:/Users/Tess4J/tessdata");
instance.setLanguage("eng");
//List<RenderedFormat> renderFormats = new ArrayList<RenderedFormat>();
//renderFormats.add(RenderedFormat.PDF);
//instance.createDocumentsWithResults(imageFile,null,"C:/Users/DFDS_Y1_2025_out2", renderFormats, TessPageIteratorLevel.RIL_BLOCK);
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.out.println("ERROR");
System.err.println(e.getMessage());
} }}
The result that gets printed to the console is:
Destination Rate O-1OT Rate 10.01-17T Full rate
-ma_———
So its the heading plus for some reason this string as well -ma_———
I was expecting all the other rows of data to be returned.
I have tried first extracting the image from the pdf and made it gray scale and then instead of processing the pdf I used the image file as input but I got the same result. I went thought the online examples the code is similar to mine, I cant see what I have to do to get the rest of the data.
I am using eclipse an this is the console output when I run the code :
I know this can be done using tesseract as I tested it here : https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html using the scribe UI based on tesseract. https://scribeocr.com/
When the pdf is uploaded to scribe it gets all the text data in the image.
I am not sure what I am doing wrong, the pdf is clear and should work. Should the image or pdf be preprocessed or what am I doing wrong.
Please let me know if you need more info.
Any help would be appreciated.
Upvotes: 1
Views: 52