Adriano_jvma
Adriano_jvma

Reputation: 475

Make Tess4J get image from PDF file

How to make Tess4J get image from PDF file?

I'm sarted on the transformation image file to text using OCR (Tess4J). It works fine, I have tested on image and it is great.

File imageFile = new File("D:\\HEAD2.png");
Tesseract instance = Tesseract.getInstance();  // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping

try {
    String result = instance.doOCR(imageFile);
    System.out.println(result);
} catch (TesseractException e) {
    System.err.println(e.getMessage());
}

But I'm facing this problem. I would parse a pdf file that contains image so. I don't kow how to do And I have not found any exemple Tess4J with pdf

I tested this example with Asprise, but I don't find any example like this on Tess4J

import com.asprise.util.pdf.PDFReader;
import com.asprise.util.ocr.OCR;

PDFReader reader = new PDFReader(new File("my.pdf"));
reader.open(); // open the file. 
int pages = reader.getNumberOfPages();

for(int i=0; i < pages; i++) {
   BufferedImage img = reader.getPageAsImage(i);

   // recognizes both characters and barcodes
   String text = new OCR().recognizeAll(image);
   System.out.println("Page " + i + ": " + text); 
}

reader.close(); // finally, close the file.

Upvotes: 0

Views: 6209

Answers (2)

Pegasus
Pegasus

Reputation: 849

Tess4j has a dependency on pdfbox, so you can use this library. It could be something like this:

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;

PDDocument document = PDDocument.load(new File("YOUR_PDF_FILE_PATH"));
PDFRenderer pdfRenderer = new PDFRenderer(document);

ITesseract tesseract = new Tesseract();

tesseract.setDatapath("tessdata");
tesseract.setLanguage("spa");

for (int page = 0; page < document.getNumberOfPages(); page++) {
    BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

    try {
        String str = tesseract.doOCR(bufferedImage);
        System.out.println(str);
    } catch (TesseractException ex) {
        Logger.getLogger(OCR.class.getName()).log(Level.SEVERE, null, ex);
    }
}
document.close();

I'm using here Tessj4 4.5 and pdf-box 2.0. You can also check https://colwil.com/how-to-extract-text-from-a-scanned-pdf-using-ocr-in-java/.

Upvotes: 1

sschrass
sschrass

Reputation: 7166

make use of pdfutilities.convertpdf2png and use it like you did before with images.

Upvotes: 2

Related Questions