Viktor Hedefalk
Viktor Hedefalk

Reputation: 3914

Reasonable Tesseract OCR settings using Apache Tika…?

I'm using Apache Tika to do text extraction and I have to handle scanned PDF images. So I'm trying Tesseract, but I'm having problems finding any good resource on good default settings…?

I'm also experiencing what seems like weird post-processing artifacts:

I get this:

"och ptensionskos nader"

from this image:

input

It really seems some post-processing has moved the t to the beginning of the word and left a blank instead. Seems super-weird to me why it would do this unless there's some very bad post-processing settings.

These are my basic settings from Apache Tika:

    val pdfConfig: PDFParserConfig = {
      val pdfConf = new PDFParserConfig()
      pdfConf.setOcrDPI(150)
      pdfConf.setDetectAngles(false)
      pdfConf.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY)
      pdfConf
    }

    val tesseractOCRConfig: TesseractOCRConfig = {
      val tessConf = new TesseractOCRConfig()
      tessConf.setLanguage("eng+swe")
      tessConf.setEnableImageProcessing(1)
      tessConf.setResize(100) // 100-900 - lower faster.
      // tessConf.setApplyRotation(true)
      tessConf
    }

Any help highly appreciated!

Upvotes: 0

Views: 978

Answers (1)

marek.kapowicki
marek.kapowicki

Reputation: 732

It is also an important property in pdf config to skip/include internal images processing

pdfConf.setExtractInlineImages(true) //for the scanned pdf setting it to false has no sense

In the TesseractOCRConfig the usefil is also setTimeout()

Upvotes: 1

Related Questions