Reputation: 1391
I have a problem with running tesseract-ocr engine on linux. I've downloaded RUS language data and put it to tessdata directory (/usr/local/share/tessdata). When I'm trying to run tesseract with command tesseract blob.jpg out -l rus
, it displays an error:
Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language eng
Tesseract couldn't load any languages!
Could not initialize tesseract.
According to compiling guide, I used export TESSDATA_PREFIX='/usr/local/share/'
to point my tessdata directory.
Maybe I should edit any config files? Tesseract try to load 'eng' data files instead of 'rus'.
Screenshot: https://i.sstatic.net/I0Guc.png
Upvotes: 138
Views: 238521
Reputation: 1
Just adding some options to the list -- on RHEL and related systems, sudo dnf [or yum] install tesseract
installs tesseract-langpack-eng
by default. It looks like others may have to be downloaded manually.
Upvotes: 0
Reputation: 1675
On macOS, installed via macports:
After installing tesseract run
sudo port install tesseract-osd
to enable the 'osd'. This solved the error for me.
Upvotes: 0
Reputation: 1
For windows , all you need to do is
C:\Userseclipse-workspace\tessdata
)
variable name : TESSDATA_PREFIX
variable value: (eg:C:\Userseclipse-workspace\tessdata
)Upvotes: 0
Reputation: 2360
You can grab eng.traineddata
Github:
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
Check https://github.com/tesseract-ocr/tessdata for a full list of trained language data.
When you grab the file(s), move them to the /usr/local/share/tessdata
folder. Warning: some Linux distributions (such as openSUSE and Ubuntu) may be expecting it in /usr/share/tessdata
instead.
# If you got the data from Google, unzip it first!
gunzip eng.traineddata.gz
# Move the data
sudo mv -v eng.traineddata /usr/local/share/tessdata/
Upvotes: 140
Reputation: 53
In Google Colab I resolved the issue in this way:
!sudo apt-get install tesseract-ocr-*
Because if you use this command !sudo apt install tesseract-ocr
then it imports 2 languages but when you intend to work on non-English languages then the former command works.
Afterwards, use this command !pip install pytesseract
You can also check languages in this way !tesseract --list-langs
Upvotes: 5
Reputation: 1
**IF you have windows OS then please add your TesseractOCR to system variable. Eg..
thats all...
Upvotes: 0
Reputation: 51
I had the same problem with DEU language on macOS. I could solve it by installing all additional languages like so:
brew install tesseract-lang
as suggested on https://formulae.brew.sh/formula/tesseract
Upvotes: 3
Reputation: 307
For me the problem was in how I downloaded the train data files. Make sure you get the raw link.
Initially I was using:
wget https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata
When I changed it to:
wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata
It worked
Upvotes: 9
Reputation: 389
As of 2021, My solution for Ubuntu is to download the zip files from https://github.com/tesseract-ocr/tessdata_best/releases/tag/4.1.0
, extract and copy the neccessary .traineddata
files into /usr/local/share/tessdata
. This is the default folder for tesseract 4.1.1 to search for trained data.
Upvotes: 0
Reputation: 2646
For Ubuntu just run the below command and the Environment variable error will disappear.
command:
export TESSDATA_PREFIX=Path_of_your_tessdata_folder
Command Example:
export TESSDATA_PREFIX=/home/amar/Desktop/OCR/tesseract-4.1.1/tessdata
This command will set the tessdata folder's path to the environment variable with name TESSDATA_PREFIX and the above error will be resolved.
Upvotes: 5
Reputation: 1
How I solved the problem in my Manjaro Xfce:
Message “TesseractError: (1, 'Error opening data file /home/julio/snap/tesseract/common/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')”
Then, in my Manjaro, I typed: sudo pacman -S tesseract Then the system installed both the “tesseract” and also a package name “leptonica”
After this step, I thought everything was ok, and tried to run my simple script. However, the error message changed to something like this (it changed the previous “/home” location to other “/usr”-like location): “"Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.')"”
Then I realized that there had appeared this message when I installed “tesseract” with pacman: “You must install one of tesseract-data-* packages or whole tesseract-data group”
So, I tried the command: “sudo pacman -S tesseract-data”, and the system presented lots of language options to me. So I’ve chosen some languages, installed as follows, and the module started to work like a charm:
sudo pacman -S tesseract-data-eng
sudo pacman -S tesseract-data-por
sudo pacman -S tesseract-data-fra
sudo pacman -S tesseract-data-spa
I tried some portuguese special characters (like "ão"), that only worked when I used the argument "lang='por'" in the pytesseract.image_to_string(img,lang='por')
Upvotes: 0
Reputation: 11
Add this to your code :
instance.setDatapath("C:\\somepath\\tessdata");
instance.setLanguage("eng");
Upvotes: 1
Reputation: 1
tessdata_dir_config = r'--tessdata-dir "/usr/local/Cellar/tesseract/4.1.1/share/tessdata"'
pytesseract.image_to_string(imgCrop,lang='eng',config=tessdata_dir_config)
Upvotes: 0
Reputation: 952
tesseract --tessdata-dir <tessdata-folder> <image-path> stdout --oem 2 -l <lng>
In my case, the mistakes that I've made or attempts that wasn't a success.
TESSDATA_PREFIX
with above pathsFirst 2 attempts did not worked because, the files from git clone
did not worked for the reasons that I do not know. I am not sure why #3 attempt worked for me.
Finally,
wget
--tessdata-dir
with directory nameTake away for me is to learn the tool well & make use of it, rather than relying on package manager installation & directories
Upvotes: 5
Reputation: 101
C# developer working on Windows here. What works for me is simply download the file eng.traineddata from the following URL:
https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata
and copy it to the following directory in my Console Application project:
[Project Directory]\bin\Debug\tessdata
I did manually create the tessdata folder above.
Upvotes: 1
Reputation: 41
For Windows Users:
In Environment Variables, add a new variable in system variable with name "TESSDATA_PREFIX" and value is "C:\Program Files (x86)\Tesseract-OCR\tessdata"
Upvotes: 4
Reputation: 6106
I'm using windows OS, I tried all solutions above and none of them work.
Finally, I install Tesseract-OCR on D drive(Where I run my python script from) instead of C drive and it works.
So, if you are using windows, run your python script in the same drive as your Tesseract-OCR.
Upvotes: 2
Reputation: 1498
The simpliest way is to install the needed package:
sudo apt-get install tesseract-ocr-eng #for english
sudo apt-get install tesseract-ocr-tam #for tamil
sudo apt-get install tesseract-ocr-deu #for deutsch (German)
As you can notice, it opens the road to others languages (i.e. tesseract-ocr-fra).
Upvotes: 110
Reputation: 1494
You can call tesseract API function from C code:
#include <tesseract/baseapi.h>
#include <tesseract/ocrclass.h>; // ETEXT_DESC
using namespace tesseract;
class TessAPI : public TessBaseAPI {
public:
void PrintRects(int len);
};
...
TessAPI *api = new TessAPI();
int res = api->Init(NULL, "rus");
api->SetAccuracyVSpeed(AVS_MOST_ACCURATE);
api->SetImage(data, w0, h0, bpp, stride);
api->SetRectangle(x0,y0,w0,h0);
char *text;
ETEXT_DESC monitor;
api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
printf("m.count: %s\n", monitor.count);
printf("m.progress: %s\n", monitor.progress);
api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
...
api->End();
And build this code:
g++ -g -I. -I/usr/local/include -o _test test.cpp -ltesseract_api -lfreeimageplus
(i need FreeImage for picture loading)
Upvotes: 2
Reputation: 79
I'm using Visual Studio 2017 Community Edition.
I solved this problem by making a directory called tessdata in the Debug directory of my project. Then I put the eng.traineddata file into said directory.
Upvotes: 1
Reputation: 13083
I had this error too on the Windows machine.
My solution.
1) Download your language files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00
For example, for eng, I downloaded all files with eng prefix.
2) Put them into tessdata directory inside of some folder. Add this folder into System Path variables as TESSDATA_PREFIX.
Result will be System env var: TESSDATA_PREFIX=D:/Java/OCR And OCR folder has tessdata with languages files.
This is a screenshot of the directory:
Upvotes: 48
Reputation: 2017
No previous solution worked for me.
I've installed both by apt-get
and manually downloading the tessdata, moved around /usr
and so on and no one worked even if i exported the variable thousand times.
Finally, on a last try before start to cry i've tried to pass the path directly to the instance of Tesseract().
In Python: tr = Tesseract("/usr/local/share/tesseract-ocr/")
and now it works. To clarify, im using tesserwrap
module.
Upvotes: 5