Freddy
Freddy

Reputation: 521

Tesseract OCR loading a language - Japanese

I just installed Tesseract OCR and after running the command $ tesseract --list-langs the output showed only 2 languages, eng and osd. My question is, how do I load another language, in my case specifically, Japanese?

Upvotes: 9

Views: 18148

Answers (4)

Ariq Athallah
Ariq Athallah

Reputation: 66

On Mac, if you had installed tesseract using brew, then:

  1. Go to https://github.com/tesseract-ocr/tessdata and download https://github.com/tesseract-ocr/tessdata/blob/main/script/Japanese.traineddata
  2. Put the file in /opt/homebrew/Cellar/tesseract/share/tessdata/.
  3. The language code will be saved under "Japanese" or whatever the file name is.

Upvotes: 1

Amir
Amir

Reputation: 141

1. pip install pytesseract

2. for windows install tesseract-ocr from 
https://digi.bib.uni-mannheim.de/tesseract
select all language options while installing

3. set the tesseract-ocr path under anaconda/lib/site-packages/pytesseract/pytesseract.py

tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

4. from pytesseract import image_to_string
print(image_to_string(test_file, 'jpn')) #for Japenese text extraction

Upvotes: 2

Harald
Harald

Reputation: 31

This works for me:

sudo apt-get install tesseract-ocr-jpn

hope this will help.

Upvotes: 3

Freddy
Freddy

Reputation: 521

I learned that by grabbing the trained data from https://github.com/tesseract-ocr/tessdata and placing it in the same directory as the other trained data, i.e., eng.traineddata and by passing the language flag -l LANG tesseract should be able to read the language you've specified, in the following example, Japanese: tesseract -l jpn sample-jpn.png output-jpn.

Upvotes: 5

Related Questions