Joshua Segal
Joshua Segal

Reputation: 99

How to make a filestream read in UTF-8 C++

I am able to successfully read in UTF8 character text files by redirecting input and output on the terminal and then using wcin and wcout

_setmode(_fileno(stdout), _O_U8TEXT);
_setmode(_fileno(stdin), _O_U8TEXT);

Now I'd like to be able to read in UTF8 text using filestreams, but I don't know how to set the mode of the filestreams so that it could read in these characters like I did with stdin and stdout. I've tried using wifstreams/wofstreams and those still read and write garbage, by themselves.

Upvotes: 1

Views: 590

Answers (1)

Xirema
Xirema

Reputation: 20396

C++'s <iostreams> library doesn't have built-in support for conversions from one text encoding to another. If you need your input text converted from utf-8 into another format (say, for example, the underlying codepoints of the encoding), you'll need to write that conversion manually.

std::string data;
std::ifstream in("utf8.txt");
in.seekg(0, std::ios::end);
auto size = in.tellg();
in.seekg(0, std::ios::beg);
data.resize(size);
in.read(data.data(), size);
//data now contains the entire contents of the file

uint32_t partial_codepoint = 0;
unsigned num_of_bytes = 0;
std::vector<uint32_t> codepoints;
for(char c : data) {
    uint8_t byte = uint8_t(c);
    if(byte < 128) {
        //Character is just a basic ascii character, so we'll just set that as the codepoint value
        codepoints.push_back(byte);
        if(num_of_bytes > 0) {
            //Data was malformed: error handling?
            //Codepoint abruptly ended
        }
    } else {
        //Character is part of multi-byte encoding
        if(partial_codepoint) {
            //We've already begun storing the codepoint
            if((byte >> 6) != 0b10) {
                //Data was malformed: error handling?
                //Codepoint abruptly ended
            }
            partial_codepoint = (partial_codepoint << 6) | (0b0011'1111 & byte);
            num_of_bytes--;
            if(num_of_bytes == 0) {
                codepoints.emplace_back(partial_codepoint);
                partial_codepoint = 0;
            }
        } else {
            //Beginning of new codepoint
            if((byte >> 6) == 0b10) {
                //Data was malformed: error handling?
                //Codepoint did not have proper beginning
            }
            while(byte & 0b1000'0000) {
                num_of_bytes++;
                byte = byte << 1;
            }
            partial_codepoint = byte >> num_of_bytes;
        }
    }
}

This code will reliably convert from [correctly-encoded] utf-8 to utf-32, which is usually the easiest form to convert directly into glyphs + characters—though remember that codepoints are not characters.

To keep things consistent in your code, my recommendation is that utf-8 encoded text be stored in your program using std::string, and utf-32 encoded text be stored as std::vector<uint32_t>.

Upvotes: 2

Related Questions