Reputation: 33813
I intend to create a library that treats with strings, but the first thing that came to my mind is supporting all languages, among these languages Asian languages like Chinese, Japanese and the languages that start from right-to-left like Arabic, Persian, and so on.
So, I want to know whether "UTF-8" which represented in data types char*
& std::string
is enough to support all languages for reading and writing or I should use "UTF-16" which represented in data types wchar_t*
& std::wstring
?
In short, which data type should be used and suitable for this mission whether these data types or other?
Upvotes: 3
Views: 1557
Reputation: 299345
There are a few confusions in your question, so I'll start with the answer you're probably looking for, and move out from there:
You should encode in UTF-8 unless you have a very good reason not to encode in UTF-8. There are several good reasons, but none of them have to do with what languages are supported.
UTF-8 and UTF-16 are just different ways to encode Unicode. You can also encode Unicode in UTF-32. You can even encode Unicode in GB18030, or one of several other encodings. As long as the encoding can handle all Unicode code points, then it will cover the same number of languages, glyphs, scripts, characters, etc. (Nailing down precisely what is meant by a Unicode code point is itself a subtle topic that I don't want to get into here, but for these purposes, let's think of it a "character.")
You should generally use UTF-8 because it's extremely efficient if you're working in Latin-based scripts, and it's the most commonly supported encoding in that ecosystem. That said, for some problems, UTF-16 or UTF-32 can be more efficient. But without a specific reason, you should use UTF-8.
The data types char*
and std::string
do not represent UTF-8. They represent a sequence of char
. That's all they represent. That sequence of char
can be interpreted in many ways. It is fairly common to interpret it as UTF-8, but I wouldn't even say that's the most common interpretation (many systems treat it as extended ASCII, which is why non-English text often gets garbled as it moves between systems).
If you want to work in UTF-8, you often have to do more than use std:string
. You need a UTF-8 handling library, most commonly std::locale
for simple usage or ICU for more complex problems. UTF-8 characters can be between 1 and 4 char
long, so you have to be very thoughtful when applying character processing. The most common mistake is that UTF-8 does not support random-access. You can't just jump to the 32nd letter in a string. You have to process it from the start to find all the character breaks. If you start processing a UTF-8 string at a random point, you may jump into the middle of a character.
Through combining characters, UTF-8 encodings can become (in many systems) arbitrarily long. The visually single "character" ๐ฉโ๐ฉโ๐งโ๐ฆ is encoded as a sequence of 25 char
values in UTF-8. (Of course it's encoded as 12 wchar_t
values in UTF-16. No Unicode encoding saves you from having to think about combining characters.)
On the other hand, UTF-8 is so powerful because you can often ignore it for certain problems. The character A
encodes in UTF-8 exactly as it does in ASCII (65), and UTF-8 promises that there will be no bytes in a sequence that are 65 and aren't A
. So searching for specific ASCII sequences requires no special processing (the way it does in UTF-16).
As NathanOliver points out, using any Unicode encoding will only support the languages, glyphs, scripts, characters, etc. that Unicode supports. As a practical matter, that is the vast majority of the commonly used languages in the world. It is not every language (and it has failings in how it handles some languages that it does support), but it's by far the most comprehensive system we have today.
Upvotes: 3
Reputation: 180650
No, UTF-8 is not enough to support all languages (yet). From As Yet Unsupported Scripts
- Loma
- Naxi Dongba (Moso)
are currently not supported.
Upvotes: 0