Felix Yan
Felix Yan

Reputation: 15259

Python OCR Module in Linux?

I want to find a easy-to-use OCR python module in linux, I have found pytesser http://code.google.com/p/pytesser/, but it contains a .exe executable file.

I tried changed the code to use wine, and it really works, but it's too slow and really not a good idea.

Is there any Linux alternatives that as easy-to-use as it?

Upvotes: 21

Views: 17019

Answers (5)

Vajk Hermecz
Vajk Hermecz

Reputation: 5702

You have a bunch of options here.

One way, as others pointed out is to use tesseract. Looks like there are a bunch of wrappers by now, so best way is to do a quick pypi search for it. The most used ones these days are:

Another useful site for finding similar engines is alternative.to. A few linux based systems according to them are:

  • ABBYY
  • Tesseract
  • CuneiForm
  • Ocropus
  • GOCR

Upvotes: 0

Blender
Blender

Reputation: 298096

You can just wrap tesseract in a function:

import os
import tempfile
import subprocess

def ocr(path):
    temp = tempfile.NamedTemporaryFile(delete=False)

    process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    process.communicate()

    with open(temp.name + '.txt', 'r') as handle:
        contents = handle.read()

    os.remove(temp.name + '.txt')
    os.remove(temp.name)

    return contents

If you want document segmentation and more advanced features, try out OCRopus.

Upvotes: 17

FreeToGo
FreeToGo

Reputation: 428

python tesseract

http://code.google.com/p/python-tesseract

import cv2.cv as cv
import tesseract

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()

Upvotes: 6

Jaime Ivan Cervantes
Jaime Ivan Cervantes

Reputation: 3697

You should try the excellent scikits.learn libraries for machine learning. You can find two codes that are ready to run here and here.

Upvotes: 1

Tomato
Tomato

Reputation: 2177

In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.

ABBYY comand line OCR utility: http://ocr4linux.com/en:start

It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.

Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

Disclaimer: I work for ABBYY

Upvotes: 11

Related Questions