Reputation: 1979
how to restrict the results of tess-two (Tesseract and Leptonica library),
I want Tesseract limiting the results:
For Example:
The recognition result is "asn*&bhDK 1234 UDaks&%^jdg", then simply take is "DK1234UD".
so, don't take LowerChase, Enter, Space. Only take UperChase and numbers.
I use Java source code
this is the recognition code:
TessBaseAPI baseApi = new TessBaseAPI();
baseApi.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
baseApi.setPageSegMode(PageSegMode.PSM_AUTO_OSD);
baseApi.setPageSegMode(PageSegMode.PSM_SINGLE_LINE);
baseApi.setDebug(true);
baseApi.init(DATA_PATH, lang);
//setImage
baseApi.setImage(bmpOtsu);
//set whitelist
String whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whitelist);
//variable for recognizing
String recognizedText = baseApi.getUTF8Text();
String resultTxt = recognizedText;
baseApi.end();
if ( lang.equalsIgnoreCase("eng") ) {
recognizedText = recognizedText.replaceAll("[^A-Z0-9]", " ");
}
Can somebody tell me how can i do that? What should be added in here?
Upvotes: 1
Views: 4134
Reputation: 1979
Thx to @Yazan for the answer and it's work.
and i've improve the answers.
this is my code:
TessBaseAPI baseApi = new TessBaseAPI();
baseApi.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
baseApi.setPageSegMode(PageSegMode.PSM_AUTO_OSD);
baseApi.setPageSegMode(PageSegMode.PSM_SINGLE_LINE);
baseApi.setDebug(true);
baseApi.init(DATA_PATH, lang);
//set variable
String whiteList = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
String blackList = "\\s";
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whiteList);
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, blackList);
//setImage
//baseApi.setImage(bmpOtsu, w, h, 8, (Integer) null);
baseApi.setImage(bmpOtsu);
//variable for recognizing
String recognizedText = baseApi.getUTF8Text();
recognizedText = recognizedText.replaceAll(blackList, "");//remove space
String resultTxt = recognizedText;
//
baseApi.end();
Log.v(TAG, "OCRED TEXT: " + recognizedText);
if ( lang.equalsIgnoreCase("eng") ) {
int get8digits = recognizedText.indexOf("D");
String loop = recognizedText.substring(get8digits, recognizedText.length());
if(recognizedText.contains("D") && loop.length() >= 8){
Log.w(TAG, "OPSI 1"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
recognizedText = recognizedText.substring(get8digits, get8digits+8);
}else if(recognizedText.contains("D") && loop.length() < 8){
Log.w(TAG, "OPSI 2"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
recognizedText = loop;
}else{
Log.w(TAG, "OPSI 3"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
recognizedText = recognizedText.replaceAll("[A-Z0-9]"," ");
}
I hope this helps anyone.
Upvotes: 2
Reputation: 6082
if you use instance of TessBaseAPI
, you can call setVariable()
with constant VAR_CHAR_WHITELIST
String whiteList = "ABCD...XYZ1234567890";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST,whiteList);
you can tune the white list based on your needs so if you want to ignore all other letters except D and K, set it:
String whiteList = "DK1234567890";
you might still need to do more string manipulation on the result if needed, like removing letters from end of result, as based on your example you may get this as result (using second whilteList)
DK1234UD
EDIT:
To Get result from : DK123455UD you can use substring()
String result = "DK123455UD";
int pos = result.indexOf("DK");
String finalResult = result.substring(pos,pos+8);
EDIT:
Like This ?
String whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whitelist);
//setImage
baseApi.setImage(bmpOtsu);
//variable for recognizing
String recognizedText = baseApi.getUTF8Text();
//
int get8digits = recognizedText.indexOf("D");
String resultTxt = recognizedText.substring(get8digits, get8digits+8);
Upvotes: 3