Reputation: 2606
I am noob in C++ so I am very sorry for asking stupid question.
I have a piece of text: Павло
I get it somewhere from console output in piece of code I am working on. I know that this is cyrillic word hidded behind it. It's real value is "Петро".
With online encoding detector I have found that to read this text properly, I have to convert it from UTF-8 to Windows 1252.
How can I do it with code?
I have tried this, it gives some results, but it outputs 5 questionmarks (at least lenght expected)
wchar_t *CodePageToUnicode(int codePage, const char *src)
{
if (!src) return 0;
int srcLen = strlen(src);
if (!srcLen)
{
wchar_t *w = new wchar_t[1];
w[0] = 0;
return w;
}
int requiredSize = MultiByteToWideChar(codePage,
0,
src, srcLen, 0, 0);
if (!requiredSize)
{
return 0;
}
wchar_t *w = new wchar_t[requiredSize + 1];
w[requiredSize] = 0;
int retval = MultiByteToWideChar(codePage,
0,
src, srcLen, w, requiredSize);
if (!retval)
{
delete[] w;
return 0;
}
return w;
}
char *UnicodeToCodePage(int codePage, const wchar_t *src)
{
if (!src) return 0;
int srcLen = wcslen(src);
if (!srcLen)
{
char *x = new char[1];
x[0] = '\0';
return x;
}
int requiredSize = WideCharToMultiByte(codePage,
0,
src, srcLen, 0, 0, 0, 0);
if (!requiredSize)
{
return 0;
}
char *x = new char[requiredSize + 1];
x[requiredSize] = 0;
int retval = WideCharToMultiByte(codePage,
0,
src, srcLen, x, requiredSize, 0, 0);
if (!retval)
{
delete[] x;
return 0;
}
return x;
}
int main()
{
const char *text = "Павло";
// Now convert utf-8 back to ANSI:
wchar_t *wText2 = CodePageToUnicode(65001, text);
char *ansiText = UnicodeToCodePage(1252, wText2);
cout << ansiText;
_getch();
}
also tried this, but it's not working propery
int main()
{
const char *orig = "Павло";
size_t origsize = strlen(orig) + 1;
const size_t newsize = 100;
size_t convertedChars = 0;
wchar_t wcstring[newsize];
mbstowcs_s(&convertedChars, wcstring, origsize, orig, _TRUNCATE);
wcscat_s(wcstring, L" (wchar_t *)");
std::wstring strUTF(wcstring);
const wchar_t* szWCHAR = strUTF.c_str();
cout << szWCHAR << '\n';
char *buffer = new char[origsize / 2 + 1];
WideCharToMultiByte(CP_ACP, 0, szWCHAR, -1, buffer, 256, NULL, NULL);
cout << buffer;
_getch();
}
Upvotes: 0
Views: 3622
Reputation: 31599
This is a printing issue. Your first function is correct, you can test it MessageBoxW
:
wchar_t *wbuf = CodePageToUnicode(CP_UTF8, "Павло");
if(wbuf)
{
MessageBoxW(0, wbuf, 0, 0);
delete[]buf;
}
Output
"Павло"
(not the same as what you said!)
You can print wide characters with std::wcout
, or simplify the function to print using 1251 code page as follows:
#include <iostream>
#include <string>
#include <Windows.h>
int main()
{
char *buf = "Павло";
int size;
size = MultiByteToWideChar(CP_UTF8, 0, buf, -1, 0, 0);
std::wstring wstr(size, 0);
MultiByteToWideChar(CP_UTF8, 0, buf, -1, &wstr[0], size);
int codepage = 1251;
size = WideCharToMultiByte(codepage, 0, &wstr[0], -1, 0, 0, 0, 0);
std::string str(size, 0);
WideCharToMultiByte(codepage, 0, &wstr[0], -1, &str[0], size, 0, 0);
SetConsoleOutputCP(codepage);
std::cout << str << "\n";
return 0;
}
Upvotes: 2
Reputation: 2937
There are a few options
Using Windows API
Convert your UTF-8
to system UTF-16LE
using MultiByteToWideChar
and then from UTF-16LE
to CP1251
(Cyrillic is 1251 not 1252) over WideCharToMultiByte
Using MS MLAGN API
Using GNU ICONV library
Using IBM ICU
If you simply need to output your UNICODE into console, check this
Upvotes: 4