mahen
mahen

Reputation: 175

How to Encoding Japanese text in Japanese window OS?

I am using Tesseract for read japanes text. I am getting below text from OCR.

日付 請求書

C++ code

 extern "C" _declspec(dllexport) char* _cdecl Test(char* imagePath)
    {
        char *outText;

        tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
        // Initialize tesseract-ocr with English, without specifying tessdata path
        if (api->Init("D:\\tessdata", "jpn", tesseract::OcrEngineMode::OEM_TESSERACT_ONLY))
        {
            fprintf(stderr, "Could not initialize tesseract.\n");           
        }

        api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO);      
        outText = api->GetUTF8Text();

        return outText;
    }

c#

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
        public static extern string Test(string imagePath);

        void Tessrect()
        {
            string result = Test("D:\\japan4.png");
            byte[] bytes = System.Text.Encoding.Default.GetBytes(result);
            MessageBox.Show(System.Text.Encoding.UTF8.GetString(bytes));
        }

Input File: enter image description here

The above code is working fine in window English. But it not working in window japanes. It gives the wrong output in window's Japanes OS.

Can any one guide me how to get it correct for Japanes Window?

Upvotes: 0

Views: 431

Answers (3)

tommybee
tommybee

Reputation: 2579

You have to make an image object from the imagePath first.

In my case, this is done by using famous like opencv. Then, use SetImage fuction.

void detectJpn(cv::Mat& img)
{
    char *outText;

    // Create Tesseract object
    tesseract::TessBaseAPI *ocr = new tesseract::TessBaseAPI();

    ocr->Init(NULL, "jpn", tesseract::OEM_TESSERACT_ONLY);

    // Set Page segmentation mode to PSM_AUTO (3)
    ocr->SetPageSegMode(tesseract::PSM_AUTO);

    ocr->SetImage((uchar*)img.data, img.size().width, img.size().height, img.channels(), img.step1());

    // Run Tesseract OCR on image
    outText = ocr->GetUTF8Text();

    // print recognized text
    std::cout << outText << std::endl; // Destroy used object and release memory ocr->End();

    //ocr->Clear();
    //ocr->End();

    delete ocr;
    ocr = nullptr;
}


int main(int argc, char *argv[])
{
    cv::Mat img = imread(argv[1], cv::IMREAD_UNCHANGED);

    detectJpn(img);     

    return 0;
}

enter image description here

Upvotes: 0

xanatos
xanatos

Reputation: 111940

The outText seems to be already in UTF-8 format

outText = api->GetUTF8Text();

Now... Returning a byte[] (or similar) from C++ is a pain... Change to:

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern IntPtr Test(string imagePath);

Then take the StringFromNativeUtf8 from here (because even converting a IntPtr that is a UTF-8 c-string is a pain... .NET doesn't natively have anything like that):

void Tessrect()
{
    IntPtr result = IntPtr.Zero;
    string result2;

    try
    {
        result = Test("D:\\japan4.png");
        result2 = StringFromNativeUtf8(result);
    }
    finally
    {
        Free(result);
    }

    MessageBox.Show(result2);
}

Then you'll have to free the IntPtr... another pain.

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern void Free(IntPtr ptr);

and

extern "C" _declspec(dllexport) void _cdecl Free(char* ptr)
{
    delete[] ptr;
}

Upvotes: 1

DannyZB
DannyZB

Reputation: 543

You are sending UTF-8 text to windows that is not UTF-8. You need to do a conversion before displaying

This Is the code that likely causes the issue(as it tries to use the default system encoding which is out of your control); byte[] bytes = System.Text.Encoding.Default.GetBytes(result);

Did you try using Encoding.UTF8 there instead?

If that alone doesn't work, try changing Encoding.UTF8 to Encoding.Default in the line following as well.

Upvotes: 0

Related Questions