Reputation: 325
I've a C program that looks like this:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main(void){
setlocale(LC_ALL,"en_US.utf8);
printf("%ls",(const wchar_t*)L"\u20AC\n");
}
The disassembled version is this:
.file "ok.c"
.text
.section .rodata
.LC0:
.string "en_US.utf8"
.align 4
.LC1:
.string "\254 "
.string ""
.string "\n"
.string ""
.string ""
.string ""
.string ""
.string ""
.string ""
.LC2:
.string "%ls"
.text
.globl main
.type main, @function
The UTF-8 octal code for my input, the €(EUR. Symbol) is '\342\202\254'. Why is only '\254' that shows and why are the rest whitespaces (excluding new line)? Without the L directive I get nothing printed as well and the asm output is something like `.string '\343\202\254'?
Upvotes: 0
Views: 1133
Reputation: 58762
L""
and wchar_t
is not utf8 in your environment, it looks like utf32. So due to endianness I expect your 4 byte wchar_t
values to be:
0xAC, 0x20, 0x00, 0x00 ; this is your \u20AC
0x0A, 0x00, 0x00, 0x00 ; this is the \n
0x00, 0x00, 0x00, 0x00 ; this is the end of string
The compiler used the fact that 0x20
is a space in ascii and that .string
emits a zero byte automatically, so:
.string "\254 " ; 0xAC, 0x20, 0x00
.string "" ; 0x00, so now you have your \u20AC
.string "\n" ; 0x0A, 0x00
.string "" ; 0x00
.string "" ; 0x00, so now you have the \n
.string "" ; 0x00
.string "" ; 0x00
.string "" ; 0x00
.string "" ; 0x00, so now you have the terminating zero
Upvotes: 1