Virus721
Virus721

Reputation: 8315

C++ / wcout / UTF-8

I'm reading a UTF-8 encoded unicode text file, and outputting it into the console, but the displayed characters are not the same as in the text editor i used to create the file. Here is my code :

#define UNICODE

#include <windows.h>
#include <iostream>
#include <fstream>
#include <string>

#include "pugixml.hpp"

using std::ifstream;
using std::ios;
using std::string;
using std::wstring;

int main( int argc, char * argv[] )
{
    ifstream oFile;

    try
    {
        string sContent;

        oFile.open ( "../config-sample.xml", ios::in );

        if( oFile.is_open() )
        {
            wchar_t wsBuffer[128];

            while( oFile.good() )
            {
                oFile >> sContent;
                mbstowcs( wsBuffer, sContent.c_str(), sizeof( wsBuffer ) );
              //wprintf( wsBuffer );// Same result as wcout.
                wcout << wsBuffer;
            }

            Sleep(100000);
        }
        else
        {
            throw L"Failed to open file";
        }
    }
    catch( const wchar_t * pwsMsg )
    {
        ::MessageBox( NULL, pwsMsg, L"Error", MB_OK | MB_TOPMOST | MB_SETFOREGROUND );
    }

    if( oFile.is_open() )
    {
        oFile.close();
    }

    return 0;
}

There must be something i don't get about encoding.

Upvotes: 0

Views: 1240

Answers (4)

yu yang Jian
yu yang Jian

Reputation: 7165

I find wifstream works very good, even in visual studio debugger shows UTF-8 words correctly (I'm reading traditional chinese words), from this post:

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}
 
//  usage
std::wstring wstr2;
wstr2 = readFile("C:\\yourUtf8File.txt");
wcout << wstr2;

Upvotes: 0

Nash Bean
Nash Bean

Reputation: 1

I made a C++ char_t container that hold up to 6 8-bit char_t storing it in a std::vector. Converting it to and from wchar_t or appending it to a std::string.

Check it out here: View UTF-8_String structures on Github

#include "UTF-8_String.h" //header from github link above

iBS::u8str  raw_v;
iBS::readu8file("TestUTF-8File.txt",raw_v);
std::cout<<raw_v.str()<<std::endl;

Here is functions that converts wchar_t to a uint32_t in the u8char struct fond in header above.

    #include <cwchar>

    u8char& operator=(wchar_t& wc)
    {
        char temp[6];
        std::mbstate_t state ;
        int ret = std::wcrtomb((&temp[0]), wc, &state);
        ref.resize(ret);
        for (short i=0; i<ret; ++i) 
            ref[i]=temp[i];
        return *this;
    };

Upvotes: 0

Mats Petersson
Mats Petersson

Reputation: 129314

The problem is that a mbstowcs doesn't actually use UTF-8. It uses an older style of "multibyte codepoints", which is not compatible with UTF-8 (although technically is is possible [I believe] to define a UTF-8 codepage, there is no such thing in Windows).

If you want to convert UTF-8 to UTF-16, you can use MultiByteToWideChar, with a codepage of CP_UTF8.

Upvotes: 3

zneak
zneak

Reputation: 138031

Wide strings don't mean UTF-8. In fact, it's quite the opposite: UTF-8 means Unicode Transformation Format (8 bits); it's a way to represent Unicode over 8-bit characters, so your normal chars. You should read it into normal strings (not wide strings).

Wide strings use wchar_t, which on Windows is 16 bits. The OS uses UTF-16 for its "wide" functions.

On Windows, UTF-8 strings can be converted to UTF-16 using MultiByteToWideChar.

Upvotes: 3

Related Questions