Reputation: 8315
I'm reading a UTF-8 encoded unicode text file, and outputting it into the console, but the displayed characters are not the same as in the text editor i used to create the file. Here is my code :
#define UNICODE
#include <windows.h>
#include <iostream>
#include <fstream>
#include <string>
#include "pugixml.hpp"
using std::ifstream;
using std::ios;
using std::string;
using std::wstring;
int main( int argc, char * argv[] )
{
ifstream oFile;
try
{
string sContent;
oFile.open ( "../config-sample.xml", ios::in );
if( oFile.is_open() )
{
wchar_t wsBuffer[128];
while( oFile.good() )
{
oFile >> sContent;
mbstowcs( wsBuffer, sContent.c_str(), sizeof( wsBuffer ) );
//wprintf( wsBuffer );// Same result as wcout.
wcout << wsBuffer;
}
Sleep(100000);
}
else
{
throw L"Failed to open file";
}
}
catch( const wchar_t * pwsMsg )
{
::MessageBox( NULL, pwsMsg, L"Error", MB_OK | MB_TOPMOST | MB_SETFOREGROUND );
}
if( oFile.is_open() )
{
oFile.close();
}
return 0;
}
There must be something i don't get about encoding.
Upvotes: 0
Views: 1240
Reputation: 7165
I find wifstream
works very good, even in visual studio debugger shows UTF-8 words correctly (I'm reading traditional chinese words), from this post:
#include <sstream>
#include <fstream>
#include <codecvt>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename);
wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
// usage
std::wstring wstr2;
wstr2 = readFile("C:\\yourUtf8File.txt");
wcout << wstr2;
Upvotes: 0
Reputation: 1
I made a C++ char_t
container that hold up to 6 8-bit char_t storing it in a std::vector
. Converting it to and from wchar_t
or appending it to a std::string
.
Check it out here: View UTF-8_String structures on Github
#include "UTF-8_String.h" //header from github link above
iBS::u8str raw_v;
iBS::readu8file("TestUTF-8File.txt",raw_v);
std::cout<<raw_v.str()<<std::endl;
Here is functions that converts wchar_t to a uint32_t in the u8char struct fond in header above.
#include <cwchar>
u8char& operator=(wchar_t& wc)
{
char temp[6];
std::mbstate_t state ;
int ret = std::wcrtomb((&temp[0]), wc, &state);
ref.resize(ret);
for (short i=0; i<ret; ++i)
ref[i]=temp[i];
return *this;
};
Upvotes: 0
Reputation: 129314
The problem is that a mbstowcs
doesn't actually use UTF-8. It uses an older style of "multibyte codepoints", which is not compatible with UTF-8 (although technically is is possible [I believe] to define a UTF-8 codepage, there is no such thing in Windows).
If you want to convert UTF-8 to UTF-16, you can use MultiByteToWideChar
, with a codepage
of CP_UTF8
.
Upvotes: 3
Reputation: 138031
Wide strings don't mean UTF-8. In fact, it's quite the opposite: UTF-8 means Unicode Transformation Format (8 bits); it's a way to represent Unicode over 8-bit characters, so your normal char
s. You should read it into normal strings (not wide strings).
Wide strings use wchar_t
, which on Windows is 16 bits. The OS uses UTF-16 for its "wide" functions.
On Windows, UTF-8 strings can be converted to UTF-16 using MultiByteToWideChar
.
Upvotes: 3