Gokul NC
Gokul NC

Reputation: 1241

How to detect language or script from an input image using Python or Tesseract OCR?

Given an input image which can be in any language or writing system, how do I detect what script the text in the picture uses?

Any Python-based or Tesseract-OCR based solution would be appreciated.


Note that script here means writing systems like Latin, Cyrillic, Devanagari, etc., for corresponding languages like English, Russian, Hindi, etc. (respectively)

Upvotes: 3

Views: 8665

Answers (1)

Gokul NC
Gokul NC

Reputation: 1241

Pre-requisites:

  • Install Tesseract: sudo apt install tesseract-ocr tesseract-ocr-all
  • Install PyTessract: pip install pytesseract

Script-Detection:

import pytesseract
import re

def detect_image_lang(img_path):
    try:
        osd = pytesseract.image_to_osd(img_path)
        script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1)
        conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)
        return script, float(conf)
    except e:
        return None, 0.0

script_name, confidence = detect_image_lang("image.png")

Language-Detection:

After performing OCR (using Tesseract), pass the text through langdetect library (or any other lib).

Upvotes: 3

Related Questions