Reputation: 3914
I'm using Apache Tika to do text extraction and I have to handle scanned PDF images. So I'm trying Tesseract, but I'm having problems finding any good resource on good default settings…?
I'm also experiencing what seems like weird post-processing artifacts:
I get this:
"och ptensionskos nader"
from this image:
It really seems some post-processing has moved the t to the beginning of the word and left a blank instead. Seems super-weird to me why it would do this unless there's some very bad post-processing settings.
These are my basic settings from Apache Tika:
val pdfConfig: PDFParserConfig = {
val pdfConf = new PDFParserConfig()
pdfConf.setOcrDPI(150)
pdfConf.setDetectAngles(false)
pdfConf.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY)
pdfConf
}
val tesseractOCRConfig: TesseractOCRConfig = {
val tessConf = new TesseractOCRConfig()
tessConf.setLanguage("eng+swe")
tessConf.setEnableImageProcessing(1)
tessConf.setResize(100) // 100-900 - lower faster.
// tessConf.setApplyRotation(true)
tessConf
}
Any help highly appreciated!
Upvotes: 0
Views: 978
Reputation: 732
It is also an important property in pdf config to skip/include internal images processing
pdfConf.setExtractInlineImages(true) //for the scanned pdf setting it to false has no sense
In the TesseractOCRConfig the usefil is also setTimeout()
Upvotes: 1