Ariasa
Ariasa

Reputation: 1979

How to limit the results of recognition?

how to restrict the results of tess-two (Tesseract and Leptonica library),
I want Tesseract limiting the results:

  1. Only take 8 digits, calculated from letter D
  2. Don't take LowerCase, Enter, Space, and Symbol
  3. Only Take Uppercase and Numbers.

For Example:
The recognition result is "asn*&bhDK 1234 UDaks&%^jdg", then simply take is "DK1234UD".
so, don't take LowerChase, Enter, Space. Only take UperChase and numbers.

I use Java source code

this is the recognition code:

    TessBaseAPI baseApi = new TessBaseAPI();
    baseApi.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
    baseApi.setPageSegMode(PageSegMode.PSM_AUTO_OSD);
    baseApi.setPageSegMode(PageSegMode.PSM_SINGLE_LINE);
    baseApi.setDebug(true);
    baseApi.init(DATA_PATH, lang);
    //setImage
    baseApi.setImage(bmpOtsu);
    //set whitelist
    String whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whitelist);
    //variable for recognizing      
    String recognizedText = baseApi.getUTF8Text();
    String resultTxt = recognizedText;
    baseApi.end();

    if ( lang.equalsIgnoreCase("eng") ) {
        recognizedText = recognizedText.replaceAll("[^A-Z0-9]", " ");
    }

Can somebody tell me how can i do that? What should be added in here?

Upvotes: 1

Views: 4134

Answers (2)

Ariasa
Ariasa

Reputation: 1979

Thx to @Yazan for the answer and it's work.
and i've improve the answers.
this is my code:

        TessBaseAPI baseApi = new TessBaseAPI();
    baseApi.setPageSegMode(TessBaseAPI.OEM_TESSERACT_CUBE_COMBINED);
    baseApi.setPageSegMode(PageSegMode.PSM_AUTO_OSD);
    baseApi.setPageSegMode(PageSegMode.PSM_SINGLE_LINE);
    baseApi.setDebug(true);
    baseApi.init(DATA_PATH, lang);
    //set variable
    String whiteList = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    String blackList = "\\s";
    baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whiteList);
    baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, blackList);
    //setImage
    //baseApi.setImage(bmpOtsu, w, h, 8, (Integer) null);
    baseApi.setImage(bmpOtsu);
    //variable for recognizing      
    String recognizedText = baseApi.getUTF8Text();
    recognizedText = recognizedText.replaceAll(blackList, "");//remove space
    String resultTxt = recognizedText;
    //
    baseApi.end();

    Log.v(TAG, "OCRED TEXT: " + recognizedText);
    if ( lang.equalsIgnoreCase("eng") ) {
        int get8digits = recognizedText.indexOf("D");
        String loop = recognizedText.substring(get8digits, recognizedText.length());
        if(recognizedText.contains("D") && loop.length() >= 8){
            Log.w(TAG, "OPSI 1"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);                
            recognizedText = recognizedText.substring(get8digits, get8digits+8);                                                
        }else if(recognizedText.contains("D") && loop.length() < 8){
            Log.w(TAG, "OPSI 2"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
            recognizedText = loop;
        }else{
            Log.w(TAG, "OPSI 3"+"\n"+"Length: "+loop.length()+"\n"+"Values: "+loop);
            recognizedText = recognizedText.replaceAll("[A-Z0-9]"," ");

        }

I hope this helps anyone.

Upvotes: 2

Yazan
Yazan

Reputation: 6082

if you use instance of TessBaseAPI, you can call setVariable() with constant VAR_CHAR_WHITELIST

String whiteList = "ABCD...XYZ1234567890";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST,whiteList);

you can tune the white list based on your needs so if you want to ignore all other letters except D and K, set it:

String whiteList = "DK1234567890";

you might still need to do more string manipulation on the result if needed, like removing letters from end of result, as based on your example you may get this as result (using second whilteList)

DK1234UD

EDIT:

To Get result from : DK123455UD you can use substring()

String result = "DK123455UD";
int pos = result.indexOf("DK");
String finalResult = result.substring(pos,pos+8);

EDIT:
Like This ?

    String whitelist = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, whitelist);
    //setImage
    baseApi.setImage(bmpOtsu);
    //variable for recognizing      
    String recognizedText = baseApi.getUTF8Text();
    //
    int get8digits = recognizedText.indexOf("D");
    String resultTxt = recognizedText.substring(get8digits, get8digits+8);

Upvotes: 3

Related Questions