PaulD
PaulD

Reputation: 212

Converting a string of multibyte characters to widechar's gives unexpected results

I'm trying to read a web-page in UTF-8 encoding using WinInet library.

Here's some of my code:

HINTERNET hUrl = ::InternetOpenUrl(hInet, wurl.c_str(),NULL,NULL,NULL,NULL);
    CHAR buffer[65536];
    std::wstring full_content;
    std::wstring read_content;
    DWORD number_of_bytes_read=1;

    while(number_of_bytes_read)
    {
        ::InternetReadFile(hUrl, buffer, 65536, &number_of_bytes_read);
    //  ::InternetReadFileExW(hUrl, &buffersw, IRF_SYNC,NULL);
            //((hUrl,buffer,65536,&number_of_bytes_read);
        read_content.resize(number_of_bytes_read);

        ::MultiByteToWideChar(CP_ACP,MB_COMPOSITE,
                     &buffer[0],number_of_bytes_read,
                     &read_content[0],number_of_bytes_read);
        full_content.append(read_content);
        //readed_content.append(buffer,number_of_bytes_read);
    }

I correctly see the english symbols, but instead of russian symbols I see a trash. What can it be?
Thanks in advance.

Upvotes: 1

Views: 649

Answers (3)

Pavel Radzivilovsky
Pavel Radzivilovsky

Reputation: 19114

Do not convert at all. Keep it UTF-8 in memory. Convert to UTF-16 only when interacting with Windows API functions.

More info on this approach in http://utf8everywhere.org.

Upvotes: 1

john
john

Reputation: 8027

Change CP_ACP to CP_UTF8 and MB_COMPOSITE to 0

From the docs

For UTF-8 or code page 54936 (GB18030, starting with Windows Vista), dwFlags must be set to either 0 or MB_ERR_INVALID_CHARS. Otherwise, the function fails with ERROR_INVALID_FLAGS.

Upvotes: 1

user1773602
user1773602

Reputation:

Your web page is UTF-8 and yet you decode it using ANSI code page (CP_ACP). Use CP_UTF8 instead

Upvotes: 3

Related Questions