Aamir
Aamir

Reputation: 15576

Reading a UTF-8 Unicode file through non-unicode code

I have to read a text file which is Unicode with UTF-8 encoding and have to write this data to another text file. The file has tab-separated data in lines.

My reading code is C++ code without unicode support. What I am doing is reading the file line-by-line in a string/char* and putting that string as-is to the destination file. I can't change the code so code-change suggestions are not welcome.

What I want to know is that while reading line-by-line can I encounter a NULL terminating character ('\0') within a line since it is unicode and one character can span multiple bytes.

My thinking was that it is quite possible that a NULL terminating character could be encountered within a line. Your thoughts?

Upvotes: 5

Views: 1086

Answers (2)

CsTamas
CsTamas

Reputation: 4153

UTF-8 uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters. The upper bits of each byte are reserved as control bits. For code points using more then 1 byte, the control bits are set.

Thus there shall not be 0 character in your UTF-8 file.

Check Wikipedia for UTF-8

Upvotes: 13

Maurice Perry
Maurice Perry

Reputation: 32831

Very unlikely: all the bytes in an UTF-8 escape sequence have the higher bit set to 1.

Upvotes: 1

Related Questions