Renier
Renier

Reputation: 1830

When using Tess4j to read a pdf image, only the first heading line is returned as a string result the rest of the image is ignored

I am using Java - Tess4j-5.13.0.jar to read a pdf containing a table like image. Its the first time using Tess4j/tesseract.

Tess4j is located here : https://github.com/nguyenq/tess4j

The pdf I am trying to convert : https://drive.google.com/file/d/1sd64gFL0A4nHAJmiekkmEwvpC2tCsLNT/view?usp=sharing

The problem is when the pdf image is processed it only returns the first heading line and the rest is ignored.

The pdf contains one image that looks like a table with a heading. The heading is returned but the rest of the table is ignored. One extra string is also returned but I do not know where that comes from. "-ma_———" enter image description here

This is my code that I used.

public static void main(String[] args) throws IOException, TesseractException {
    // TODO Auto-generated method stub
    File imageFile = new File("C:/Users/DFDS_Y1_2025.pdf");
    ITesseract instance = new Tesseract(); // JNA Interface Mapping
    instance.setDatapath("C:/Users/Tess4J/tessdata");
    instance.setLanguage("eng");
  
    //List<RenderedFormat> renderFormats = new ArrayList<RenderedFormat>();
    //renderFormats.add(RenderedFormat.PDF);
    //instance.createDocumentsWithResults(imageFile,null,"C:/Users/DFDS_Y1_2025_out2", renderFormats, TessPageIteratorLevel.RIL_BLOCK);

    try {
  
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (TesseractException e) {
        System.out.println("ERROR");
        System.err.println(e.getMessage());
    }   }}

The result that gets printed to the console is:

Destination Rate O-1OT Rate 10.01-17T Full rate

-ma_———

So its the heading plus for some reason this string as well -ma_———

I was expecting all the other rows of data to be returned.

I have tried first extracting the image from the pdf and made it gray scale and then instead of processing the pdf I used the image file as input but I got the same result. I went thought the online examples the code is similar to mine, I cant see what I have to do to get the rest of the data.

I am using eclipse an this is the console output when I run the code : enter image description here

I know this can be done using tesseract as I tested it here : https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html using the scribe UI based on tesseract. https://scribeocr.com/

When the pdf is uploaded to scribe it gets all the text data in the image.

I am not sure what I am doing wrong, the pdf is clear and should work. Should the image or pdf be preprocessed or what am I doing wrong.

Please let me know if you need more info.

Any help would be appreciated.

Upvotes: 1

Views: 52

Answers (0)

Related Questions