Reputation: 175
I am using Tesseract for read japanes text. I am getting below text from OCR.
日付 請求書
C++ code
extern "C" _declspec(dllexport) char* _cdecl Test(char* imagePath)
{
char *outText;
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
// Initialize tesseract-ocr with English, without specifying tessdata path
if (api->Init("D:\\tessdata", "jpn", tesseract::OcrEngineMode::OEM_TESSERACT_ONLY))
{
fprintf(stderr, "Could not initialize tesseract.\n");
}
api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO);
outText = api->GetUTF8Text();
return outText;
}
c#
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern string Test(string imagePath);
void Tessrect()
{
string result = Test("D:\\japan4.png");
byte[] bytes = System.Text.Encoding.Default.GetBytes(result);
MessageBox.Show(System.Text.Encoding.UTF8.GetString(bytes));
}
The above code is working fine in window English. But it not working in window japanes. It gives the wrong output in window's Japanes OS.
Can any one guide me how to get it correct for Japanes Window?
Upvotes: 0
Views: 431
Reputation: 2579
You have to make an image object from the imagePath first.
In my case, this is done by using famous like opencv. Then, use SetImage fuction.
void detectJpn(cv::Mat& img)
{
char *outText;
// Create Tesseract object
tesseract::TessBaseAPI *ocr = new tesseract::TessBaseAPI();
ocr->Init(NULL, "jpn", tesseract::OEM_TESSERACT_ONLY);
// Set Page segmentation mode to PSM_AUTO (3)
ocr->SetPageSegMode(tesseract::PSM_AUTO);
ocr->SetImage((uchar*)img.data, img.size().width, img.size().height, img.channels(), img.step1());
// Run Tesseract OCR on image
outText = ocr->GetUTF8Text();
// print recognized text
std::cout << outText << std::endl; // Destroy used object and release memory ocr->End();
//ocr->Clear();
//ocr->End();
delete ocr;
ocr = nullptr;
}
int main(int argc, char *argv[])
{
cv::Mat img = imread(argv[1], cv::IMREAD_UNCHANGED);
detectJpn(img);
return 0;
}
Upvotes: 0
Reputation: 111940
The outText
seems to be already in UTF-8 format
outText = api->GetUTF8Text();
Now... Returning a byte[]
(or similar) from C++ is a pain... Change to:
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern IntPtr Test(string imagePath);
Then take the StringFromNativeUtf8
from here (because even converting a IntPtr
that is a UTF-8 c-string is a pain... .NET doesn't natively have anything like that):
void Tessrect()
{
IntPtr result = IntPtr.Zero;
string result2;
try
{
result = Test("D:\\japan4.png");
result2 = StringFromNativeUtf8(result);
}
finally
{
Free(result);
}
MessageBox.Show(result2);
}
Then you'll have to free the IntPtr
... another pain.
[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern void Free(IntPtr ptr);
and
extern "C" _declspec(dllexport) void _cdecl Free(char* ptr)
{
delete[] ptr;
}
Upvotes: 1
Reputation: 543
You are sending UTF-8 text to windows that is not UTF-8. You need to do a conversion before displaying
This Is the code that likely causes the issue(as it tries to use the default system encoding which is out of your control); byte[] bytes = System.Text.Encoding.Default.GetBytes(result);
Did you try using Encoding.UTF8 there instead?
If that alone doesn't work, try changing Encoding.UTF8 to Encoding.Default in the line following as well.
Upvotes: 0