Reputation: 99
I am able to successfully read in UTF8 character text files by redirecting input and output on the terminal and then using wcin and wcout
_setmode(_fileno(stdout), _O_U8TEXT);
_setmode(_fileno(stdin), _O_U8TEXT);
Now I'd like to be able to read in UTF8 text using filestreams, but I don't know how to set the mode of the filestreams so that it could read in these characters like I did with stdin and stdout. I've tried using wifstreams/wofstreams and those still read and write garbage, by themselves.
Upvotes: 1
Views: 590
Reputation: 20396
C++'s <iostreams>
library doesn't have built-in support for conversions from one text encoding to another. If you need your input text converted from utf-8 into another format (say, for example, the underlying codepoints of the encoding), you'll need to write that conversion manually.
std::string data;
std::ifstream in("utf8.txt");
in.seekg(0, std::ios::end);
auto size = in.tellg();
in.seekg(0, std::ios::beg);
data.resize(size);
in.read(data.data(), size);
//data now contains the entire contents of the file
uint32_t partial_codepoint = 0;
unsigned num_of_bytes = 0;
std::vector<uint32_t> codepoints;
for(char c : data) {
uint8_t byte = uint8_t(c);
if(byte < 128) {
//Character is just a basic ascii character, so we'll just set that as the codepoint value
codepoints.push_back(byte);
if(num_of_bytes > 0) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
} else {
//Character is part of multi-byte encoding
if(partial_codepoint) {
//We've already begun storing the codepoint
if((byte >> 6) != 0b10) {
//Data was malformed: error handling?
//Codepoint abruptly ended
}
partial_codepoint = (partial_codepoint << 6) | (0b0011'1111 & byte);
num_of_bytes--;
if(num_of_bytes == 0) {
codepoints.emplace_back(partial_codepoint);
partial_codepoint = 0;
}
} else {
//Beginning of new codepoint
if((byte >> 6) == 0b10) {
//Data was malformed: error handling?
//Codepoint did not have proper beginning
}
while(byte & 0b1000'0000) {
num_of_bytes++;
byte = byte << 1;
}
partial_codepoint = byte >> num_of_bytes;
}
}
}
This code will reliably convert from [correctly-encoded] utf-8 to utf-32, which is usually the easiest form to convert directly into glyphs + characters—though remember that codepoints are not characters.
To keep things consistent in your code, my recommendation is that utf-8 encoded text be stored in your program using std::string
, and utf-32 encoded text be stored as std::vector<uint32_t>
.
Upvotes: 2