Reputation: 27
I have a program that is meant to read in a text file of words (each on a separate line), and then print out a random word from that file. It also gives you the ability to select a non-English language (e.g., Greek or Russian). Because of the latter condition, I use std::wstring
to capture the text. Here is the code:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <cstdlib>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/random_device.hpp>
#include <boost/random/uniform_int_distribution.hpp>
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cout << "Usage: word [lang]" << std::endl;
std::cout << "\tlang: Choose from de,en,es,fr,gr,it,la,ru" << std::endl;
return EXIT_FAILURE;
}
std::string file = static_cast<std::string>("C:\\util_bin\\data\\words_") + static_cast<std::string>(argv[1]) + static_cast<std::string>(".txt");
std::wfstream fin(file, std::wifstream::in);
std::vector<std::wstring> data;
std::wstring line;
while (std::getline(fin, line))
data.push_back(line);
int size = data.size();
boost::random::random_device rd;
boost::random::mt19937 mt(rd());
boost::random::uniform_int_distribution<int> dist(0, size - 1);
std::wcout << data[dist(mt)] << std::endl;
}
This code compiles just fine, however when I run it with Russian (for instance), I just get garbage text:
C:\util_bin>word ru
������������
C:\util_bin>
I'm not all that familiar with the ins and outs of wide characters in C++, so I can't really discern what's going wrong. Anyone have any ideas?
Upvotes: 0
Views: 542
Reputation: 2214
I'm going to guess you're using Visual Studio. This is a quirk of the implementation of std::basic_filebuf
in Windows. From the relevant MSDN page:
Objects of type basic_filebuf are created with an internal buffer of type
char *
regardless of thechar_type
specified by the type parameterElem
. This means that a Unicode string (containingwchar_t
characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer. To store Unicode strings in the buffer, create a new buffer of typewchar_t
and set it using thebasic_streambuf::pubsetbuf()
method.
As it was explained to me, the filebuf is implemented with a FILE*
; there is an internal flag that performs the ANSI conversion whether you want it or not, and you can't clear. the flag except by allocating and setting your own buffer (via pubsetbuf
). Putting a codecvt
in your locale won't do it. It has to happen right after a successful file open. Really, infuriatingly intrusive. I wound up having to write a wrapper class ( which wasn't so bad, as it gave you the ability to store the file name before opening).
You can also open the file with std::binary
. Some people recommend that you always do that. But opening the file that way probably makes you do your own code conversions before inserting into a stream or extracting from it.
Upvotes: 2
Reputation: 780
After you create instantiate your wfstream
object, call imbue
on it like this:
fin.imbue( std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>) );
Upvotes: 0