Pol
Pol

Reputation: 325

UTF8 encoding of multi-byte characters in C/Assembly

I've a C program that looks like this:

#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main(void){
setlocale(LC_ALL,"en_US.utf8);
printf("%ls",(const wchar_t*)L"\u20AC\n");
}

The disassembled version is this:

.file   "ok.c"
    .text
    .section    .rodata
.LC0:
    .string "en_US.utf8"
    .align 4
.LC1:
    .string "\254 "
    .string ""
    .string "\n"
    .string ""
    .string ""
    .string ""
    .string ""
    .string ""
    .string ""
.LC2:
    .string "%ls"
    .text
    .globl  main
    .type   main, @function

The UTF-8 octal code for my input, the €(EUR. Symbol) is '\342\202\254'. Why is only '\254' that shows and why are the rest whitespaces (excluding new line)? Without the L directive I get nothing printed as well and the asm output is something like `.string '\343\202\254'?

Upvotes: 0

Views: 1133

Answers (1)

Jester
Jester

Reputation: 58762

L"" and wchar_t is not utf8 in your environment, it looks like utf32. So due to endianness I expect your 4 byte wchar_t values to be:

0xAC, 0x20, 0x00, 0x00  ; this is your \u20AC
0x0A, 0x00, 0x00, 0x00  ; this is the \n
0x00, 0x00, 0x00, 0x00  ; this is the end of string

The compiler used the fact that 0x20 is a space in ascii and that .string emits a zero byte automatically, so:

.string "\254 "  ; 0xAC, 0x20, 0x00
.string ""       ; 0x00, so now you have your \u20AC
.string "\n"     ; 0x0A, 0x00
.string ""       ; 0x00
.string ""       ; 0x00, so now you have the \n
.string ""       ; 0x00
.string ""       ; 0x00
.string ""       ; 0x00
.string ""       ; 0x00, so now you have the terminating zero

Upvotes: 1

Related Questions