MrM21632
MrM21632

Reputation: 27

Issues with Wide Characters in C++

I have a program that is meant to read in a text file of words (each on a separate line), and then print out a random word from that file. It also gives you the ability to select a non-English language (e.g., Greek or Russian). Because of the latter condition, I use std::wstring to capture the text. Here is the code:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <cstdlib>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/random_device.hpp>
#include <boost/random/uniform_int_distribution.hpp>


int main(int argc, char* argv[]) {
    if (argc != 2) {
        std::cout << "Usage: word [lang]" << std::endl;
        std::cout << "\tlang: Choose from de,en,es,fr,gr,it,la,ru" << std::endl;
        return EXIT_FAILURE;
    }

    std::string file = static_cast<std::string>("C:\\util_bin\\data\\words_") + static_cast<std::string>(argv[1]) + static_cast<std::string>(".txt");
    std::wfstream fin(file, std::wifstream::in);

    std::vector<std::wstring> data;
    std::wstring line;
    while (std::getline(fin, line))
        data.push_back(line);
    int size = data.size();

    boost::random::random_device rd;
    boost::random::mt19937 mt(rd());
    boost::random::uniform_int_distribution<int> dist(0, size - 1);

    std::wcout << data[dist(mt)] << std::endl;
}

This code compiles just fine, however when I run it with Russian (for instance), I just get garbage text:

C:\util_bin>word ru
������������

C:\util_bin>

I'm not all that familiar with the ins and outs of wide characters in C++, so I can't really discern what's going wrong. Anyone have any ideas?

Upvotes: 0

Views: 542

Answers (2)

Spencer
Spencer

Reputation: 2214

I'm going to guess you're using Visual Studio. This is a quirk of the implementation of std::basic_filebuf in Windows. From the relevant MSDN page:

Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer. To store Unicode strings in the buffer, create a new buffer of type wchar_t and set it using the basic_streambuf::pubsetbuf() method.

As it was explained to me, the filebuf is implemented with a FILE*; there is an internal flag that performs the ANSI conversion whether you want it or not, and you can't clear. the flag except by allocating and setting your own buffer (via pubsetbuf). Putting a codecvt in your locale won't do it. It has to happen right after a successful file open. Really, infuriatingly intrusive. I wound up having to write a wrapper class ( which wasn't so bad, as it gave you the ability to store the file name before opening).

You can also open the file with std::binary. Some people recommend that you always do that. But opening the file that way probably makes you do your own code conversions before inserting into a stream or extracting from it.

Upvotes: 2

Vada Poch&#233;
Vada Poch&#233;

Reputation: 780

After you create instantiate your wfstream object, call imbue on it like this:

fin.imbue( std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>) );

Upvotes: 0

Related Questions