Reasonable Tesseract OCR settings using Apache Tika…?

Question

I'm using Apache Tika to do text extraction and I have to handle scanned PDF images. So I'm trying Tesseract, but I'm having problems finding any good resource on good default settings…?

I'm also experiencing what seems like weird post-processing artifacts:

I get this:

"och ptensionskos nader"

from this image:

It really seems some post-processing has moved the t to the beginning of the word and left a blank instead. Seems super-weird to me why it would do this unless there's some very bad post-processing settings.

These are my basic settings from Apache Tika:

    val pdfConfig: PDFParserConfig = {
      val pdfConf = new PDFParserConfig()
      pdfConf.setOcrDPI(150)
      pdfConf.setDetectAngles(false)
      pdfConf.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY)
      pdfConf
    }

    val tesseractOCRConfig: TesseractOCRConfig = {
      val tessConf = new TesseractOCRConfig()
      tessConf.setLanguage("eng+swe")
      tessConf.setEnableImageProcessing(1)
      tessConf.setResize(100) // 100-900 - lower faster.
      // tessConf.setApplyRotation(true)
      tessConf
    }

Any help highly appreciated!

Reasonable Tesseract OCR settings using Apache Tika…?

Answers (1)

Related Questions