I am using Tesseract for read japanes text. I am getting below text from OCR. æ—¥ä»˜ è«‹æ±‚æ›¸ C++ code extern "C" _declspec(dllexport) char* _cdecl Test(char* imagePath) { char *outText; tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); // Initialize tesseract-ocr with English, without specifying tessdata path if (api->Init("D:\\tessdata", "jpn", tesseract::OcrEngineMode::OEM_TESSERACT_ONLY)) { fprintf(stderr, "Could not initialize tesseract.\n"); } api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO); outText = api->GetUTF8Text(); return outText; } c# [DllImport(DllName, CallingConvention = CallingConvention.Cdecl)] public static extern string Test(string imagePath); void Tessrect() { string result = Test("D:\\japan4.png"); byte[] bytes = System.Text.Encoding.Default.GetBytes(result); MessageBox.Show(System.Text.Encoding.UTF8.GetString(bytes)); } Input File: The above code is working fine in window English. But it not working in window japanes. It gives the wrong output in window's Japanes OS. Can any one guide me how to get it correct for Japanes Window?

Reputation: 175

How to Encoding Japanese text in Japanese window OS?

I am using Tesseract for read japanes text. I am getting below text from OCR.

æ—¥ä»˜ è«‹æ±‚æ›¸

C++ code

 extern "C" _declspec(dllexport) char* _cdecl Test(char* imagePath)
    {
        char *outText;

        tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
        // Initialize tesseract-ocr with English, without specifying tessdata path
        if (api->Init("D:\\tessdata", "jpn", tesseract::OcrEngineMode::OEM_TESSERACT_ONLY))
        {
            fprintf(stderr, "Could not initialize tesseract.\n");           
        }

        api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO);      
        outText = api->GetUTF8Text();

        return outText;
    }

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
        public static extern string Test(string imagePath);

        void Tessrect()
        {
            string result = Test("D:\\japan4.png");
            byte[] bytes = System.Text.Encoding.Default.GetBytes(result);
            MessageBox.Show(System.Text.Encoding.UTF8.GetString(bytes));
        }

Input File:

The above code is working fine in window English. But it not working in window japanes. It gives the wrong output in window's Japanes OS.

Can any one guide me how to get it correct for Japanes Window?

Upvotes: 0

Answers (3)

tommybee

Reputation: 2579

You have to make an image object from the imagePath first.

In my case, this is done by using famous like opencv. Then, use SetImage fuction.

void detectJpn(cv::Mat& img)
{
    char *outText;

    // Create Tesseract object
    tesseract::TessBaseAPI *ocr = new tesseract::TessBaseAPI();

    ocr->Init(NULL, "jpn", tesseract::OEM_TESSERACT_ONLY);

    // Set Page segmentation mode to PSM_AUTO (3)
    ocr->SetPageSegMode(tesseract::PSM_AUTO);

    ocr->SetImage((uchar*)img.data, img.size().width, img.size().height, img.channels(), img.step1());

    // Run Tesseract OCR on image
    outText = ocr->GetUTF8Text();

    // print recognized text
    std::cout << outText << std::endl; // Destroy used object and release memory ocr->End();

    //ocr->Clear();
    //ocr->End();

    delete ocr;
    ocr = nullptr;
}


int main(int argc, char *argv[])
{
    cv::Mat img = imread(argv[1], cv::IMREAD_UNCHANGED);

    detectJpn(img);     

    return 0;
}

Upvotes: 0

xanatos

Reputation: 111940

The outText seems to be already in UTF-8 format

outText = api->GetUTF8Text();

Now... Returning a byte[] (or similar) from C++ is a pain... Change to:

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern IntPtr Test(string imagePath);

Then take the StringFromNativeUtf8 from here (because even converting a IntPtr that is a UTF-8 c-string is a pain... .NET doesn't natively have anything like that):

void Tessrect()
{
    IntPtr result = IntPtr.Zero;
    string result2;

    try
    {
        result = Test("D:\\japan4.png");
        result2 = StringFromNativeUtf8(result);
    }
    finally
    {
        Free(result);
    }

    MessageBox.Show(result2);
}

Then you'll have to free the IntPtr... another pain.

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern void Free(IntPtr ptr);

and

extern "C" _declspec(dllexport) void _cdecl Free(char* ptr)
{
    delete[] ptr;
}

Upvotes: 1

DannyZB

Reputation: 543

You are sending UTF-8 text to windows that is not UTF-8. You need to do a conversion before displaying

This Is the code that likely causes the issue(as it tries to use the default system encoding which is out of your control); byte[] bytes = System.Text.Encoding.Default.GetBytes(result);

Did you try using Encoding.UTF8 there instead?

If that alone doesn't work, try changing Encoding.UTF8 to Encoding.Default in the line following as well.

Upvotes: 0

How to Encoding Japanese text in Japanese window OS?

Answers (3)

Related Questions