Reputation: 635
I find a bit difficult to fully grasp the use of u8
strings. I know they are UTF-8-encoded strings, but the results of my tests seem to point towards another direction. I'm using gcc 7.5 on Linux. This is my test code:
#include <stdio.h>
#include <string.h>
int main()
{
char a[] = u8"gå";
int l = strlen(a);
for(int i=0; i<l; i++)
printf("%c - %d - %ld\n", a[i], (unsigned char)a[i], sizeof(a[i]));
printf("%d: %s\n", l, a);
return 0;
}
After running, I get this:
g - 103 - 1
� - 195 - 1
� - 165 - 1
3: gå
Which makes sense: it's using 2 bytes to encode the å
, and 1 byte to encode the g
, total, 3 bytes.
Then I remove the u8
prefix, and I get the same result. I might think, according to the standard, that gcc is actually using UTF-8 to encode strings by default. So far, so good.
But now I try something else: I restore the u8
prefix back again, and change the encoding of the source file to ISO-8859. And I get this:
g - 103 - 1
� - 229 - 1
2: g�
Not only the encoding has changed (it shouldn't have, as it's a u8
string), but the string prints incorrectly. If I remove the prefix again, I get this last result once more.
It's acting as if the u8
prefix is ignored, and the encoding is decided by the source file text encoding.
So my 2 questions here are:
u8
prefix doing anything?Upvotes: 2
Views: 693
Reputation: 26066
u8
only ensures the string in your binary is UTF-8 encoded regardless of the execution character set. It is a noop if you target UTF-8.
Problems arise when the source character set you told the compiler to use does not match the encoding of the file. If those do match, and the string was properly reencoded when saving the file, and you use u8
, then in both cases you should not see any difference in the output. If you do not use u8
, then the result depends on the execution character set.
Upvotes: 1