NoSenseEtAl
NoSenseEtAl

Reputation: 29996

How to write and read UTF16 file on Win using C++

There is a plenty of questions on SO regarding this, but most of them do not mention writing wstring back to file. So for example I found this for reading:

// open as a byte stream
std::wifstream fin("/testutf16.txt", std::ios::binary);
// apply BOM-sensitive UTF-16 facet
fin.imbue(std::locale(fin.getloc(),
    new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
// read  
std::wstring ws;
for(wchar_t c; fin.get(c); )
{
    std::cout << std::showbase << std::hex << c << '\n';
    ws.push_back(c);
}

I tried similar stuff for writing:

    std::wofstream wofs("/utf16dump.txt", std::ios::binary);
    wofs.imbue(std::locale(wofs.getloc(),
        new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
    wofs << ws;

but it produces garbage, (or Notpad++ and vim cant interpret it). As mentioned in the title Im on Win, native C++, VS 2010.

Input file:

t€stUTF16✡
test

This is what is the result:

t€stUTF16✡
test

convert to hex:

0000000: 7400 ac20 7300 7400 5500 5400 4600 3100  t.. s.t.U.T.F.1.
0000010: 3600 2127 0d00 0a00 7400 6500 7300 7400  6.!'....t.e.s.t.
0000020: 0a                                       
                     ...

vim normal output:

t^@¬ s^@t^@U^@T^@F^@1^@6^@!'^M^@ ^@t^@e^@s^@t^@

EDIT: I ended up using UTF8. Andrei Alexandrescu says it is the best encoding so no big loss. :)

Upvotes: 1

Views: 7023

Answers (3)

Yarkov Anton
Yarkov Anton

Reputation: 679

It is easy if you use the C++11 standard (because there are a lot of additional includes like "utf8" which solves this problems forever).

But if you want to use multi-platform code with older standards, you can use this method to write with streams:

  1. Read the article about UTF converter for streams
  2. Add stxutif.h to your project from sources above
  3. Open the file in ANSI mode and add the BOM to the start of a file, like this:

    std::ofstream fs;
    fs.open(filepath, std::ios::out|std::ios::binary);
    
    unsigned char smarker[3];
    smarker[0] = 0xEF;
    smarker[1] = 0xBB;
    smarker[2] = 0xBF;
    
    fs << smarker;
    fs.close();
    
  4. Then open the file as UTF and write your content there:

    std::wofstream fs;
    fs.open(filepath, std::ios::out|std::ios::app);
    
    std::locale utf8_locale(std::locale(), new utf8cvt<false>);
    fs.imbue(utf8_locale); 
    
    fs << .. // Write anything you want...
    

Upvotes: 2

Ben Voigt
Ben Voigt

Reputation: 283614

For output, you want to use generate_header instead of consume_header.

See http://en.cppreference.com/w/cpp/locale/codecvt_mode

Upvotes: 1

Ben Voigt
Ben Voigt

Reputation: 283614

Your similar code -- isn't. You removed the std::ios::binary style, despite the fact that the documentation says

The byte stream should be written to a binary file; it can be corrupted if written to a text file.

NL->CRLF conversion in ASCII mode isn't going to do pretty things to UTF-16 files, since it will insert one byte 0x0D instead of two bytes 0x00 0x0D.

Upvotes: 3

Related Questions