Reputation: 4572

C++ Correctly read files whose Unicode characters might be larger than a byte

I've spent many hours now reading about Unicode, its encodings and many related topics.
The reason behind my research is because I am trying to read the contents of a file and parse them character by character.

Correct me if I am wrong please:

C++'s getc() returns an int which might equal EOF.
If the return value does not equal EOF it can be ~~interpreted as a~~ safely assigned to a char.
Since std::string is based on char we can build std::strings with these chars and use those.

I have a C# background where we use C#'s char (16bit) for strings.
The value of these chars map directly to unicode values.
A char whose value is 5 is equal to the unicode character located at U+0005.

What I don't understand is how to read a file in C++ that contains characters whose values might be larger than a byte. I don't feel comfortable using getc() when I can only read characters whose values are limited to a byte.

I might be missing an important point on how to correctly read files with C++.
Any insights are very much appreciated.

I am running a Windows 10 x64 using VC++.
But I'd prefer to keep this question platform-independent if that is possible.

EDIT

I'd like to emphasize a stack overflow post linked in the comments by Klitos Kyriacou:
How well is Unicode supported in C++11?

It's a quick dive into how bad Unicode is supported in C++.
For more details you should read/watch the resources provided in the accepted answer.

Upvotes: 2

Answers (3)

zett42

Reputation: 27766

The equivalent for a 16-bit "character" that is compatible with the Windows API would be wchar_t. Be aware though that wchar_t might be 32-bit on some platforms, so use char16_t if you want to store an UTF-16 encoded string in a platform-independent way.

If you use char16_t on the Windows platform you have to do some casts though when passing strings to the OS API.

The equivalent string types are:

std::wstring (wchar_t)
std::u16string (char16_t)

File stream types:

std::wifstream (a typedef for std::basic_ifstream<wchar_t>)
std::basic_ifstream<char16_t>
std::wofstream (a typedef for std::basic_ofstream<wchar_t>)
std::basic_ofstream<char16_t>

Example to read an UTF-8 encoded file into an UTF-16 string:

#include <windows.h>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

int main()
{   
    std::wifstream file( L"test_utf8.txt" );

    // Apply a locale to read UTF-8 file, skip the BOM if present and convert to UTF-16.
    file.imbue( std::locale( file.getloc(),
        new std::codecvt_utf8_utf16<wchar_t, 0x10ffff, std::consume_header> ) );

    std::wstring str;
    std::getline( file, str );

    ::MessageBox( 0, str.data(), L"test", 0 );

    return 0;
}

How to read an UTF-16 encoded file into a 16-bit std::wstring or std::u16string?

Apparently this isn't so easy. There is std::codecvt_utf16 but when used with 16-bit wchar_t character type it produces UCS-2 which is only a subset of UTF-16, so surrogate pairs won't be read correctly. See cppreference example.

I don't know how the C++ ISO committee came to this decision, because it's completely useless in practice. At least they should have provided a flag so we could choose if we want to restrict ourselfs to UCS-2 or want to read the full UTF-16 range.

Maybe there is another solution but right now I'm not aware of it.

Upvotes: 4

Malcolm McLean

Reputation: 6404

The situation is that C's getc() was written in the 1970s. To all intents and purposes, it means "read an octet", not "read a character". Virtually all binary data is built on octets.

Unicode allows characters beyond the range an octet can represent. So, naively, the Unicode people proposed a standard for 16 bit characters. Microsoft then incorporated the proposal early on and added wide characters (wchar_t and so on) to Windows. One problem was that 16 bits are not enough to represent every glyph in every human language with some status, another was the endianness of the binary files. So the Unicode people had to add a 32-bit unicode standard, and they then had in incorporate a little enianness and format tag at the start of Unicode files. Finally, the 16-bit Unicode glyphs didn't quite match Microsoft's wchar_t glyphs.

So the result was a mess. It is quite difficult to read and display 16 or 32 bit Unicode files with complete accuracy and portability. Also, very many programs were still using 8 bit ascii.

Fortunately, UTF-8 was invented. UTF-8 is backwards compatible with 7-bit ascii. If the top bit is set, then the glyph is encoded by more than one character, and there's a scheme that tells you how many. The nul byte never appears except as an end-of-string indicator. So most programs will process UTF-8 correctly, unless they try to split strings or otherwise try to treat them as English.

UTF-8 has the penalty that random access to chars isn't possible, because of the variable length rule. But that's a minor disadvantage. Generally UTF-8 is the way to go for saving Unicode text and passing it about in programs, and you should only break it out into Unicode code points when you actually need the glyphs, e.g. for display purposes.

Upvotes: 4

Trevor Hickey

Reputation: 37834

I'd recommend watching Unicode in C++ by James McNellis.
That will help explain what facilitates C++ has and does not have when dealing with Unicode.
You will see that C++ lacks good support for easily working with UTF8.

Since it sounds like you want to iterate over each glyph (not just code points),
I'd recomend using a 3rd pary library to handle the intricacies.
utfcpp has worked well for me.

Upvotes: 0

C++ Correctly read files whose Unicode characters might be larger than a byte

EDIT

Answers (3)

Related Questions