Reputation: 7350
In my Linux program being developed in C with ncurses I need to read the stdin in UTF-8 encoding. However, whenever I do :
wint_t unicode_char=0;
get_wch(&unicode_char);
I get the wide character in utf-16 encoding (I can see it when I dump the variable with gdb). I do not want to convert it from utf-16 to utf-8, I want to force the input to be in UTF-8 all the time, no matter which Linux distribution runs my program with whatever foreign language the user has it configured. How is this done? Is it possible?
EDIT: Here is the example source and proof that internally get_wch uses UTF-16 (which is the same as UTF-32) and not UTF-8, despite that I configured UTF-8 input source with setlocale().
[niko@dev1 ncurses]$ gcc -g -o getch -std=c99 $(ncursesw5-config --cflags --libs) getch.c
[niko@dev1 ncurses]$ cat getch.c
#define _GNU_SOURCE
#include <locale.h>
#include <ncursesw/ncurses.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int ct;
wint_t unichar;
int main(int argc, char *argv[])
{
setlocale(LC_ALL, ""); /* make sure UTF8 */
initscr();
raw();
keypad(stdscr, TRUE);
ct = get_wch(&unichar); /* read character */
mvprintw(24, 0, "Key pressed is = %4x ", unichar);
refresh();
getch();
endwin();
return 0;
}
Testing code with GDB:
🔎
Breakpoint 1, main (argc=1, argv=0x7fffffffded8) at getch.c:18
18 mvprintw(24, 0, "Key pressed is = %4x ", unichar);
Missing separate debuginfos, use: dnf debuginfo-install ncurses-libs-5.9-21.20150214.fc23.x86_64
(gdb) print unichar
$1 = 128270
(gdb) print/x ((unsigned short*) (&unichar))[0]
$2 = 0xf50e
(gdb) print/x ((unsigned short*) (&unichar))[1]
$3 = 0x1
(gdb) print/x ((unsigned char*) (&unichar))[0]
$4 = 0xe
(gdb) print/x ((unsigned char*) (&unichar))[1]
$5 = 0xf5
(gdb) print/x ((unsigned char*) (&unichar))[2]
$6 = 0x1
(gdb) print/x ((unsigned char*) (&unichar))[3]
$7 = 0x0
(gdb)
The input character is 🔎, and its UTF-8 should be 'f09f948e' as stated here: http://www.fileformat.info/info/unicode/char/1f50e/index.htm
How do I get UTF8 directly from get_wch() ?? Or maybe there is another function ?
P.S. if you test the source code, link against '-lncursesw' , not '-lncurses' or compile with the same command as I did above
Upvotes: 4
Views: 2575
Reputation: 54465
Short: you don't get UTF-8
from get_wch
. That returns a wint_t
(and a status code).
Long: you would get UTF-8
from ncurses getch
because it converts to/from wchar_t
internally:
getch
only returns bytes (possibly combined with video attributes).wchar_t
values in the cells of each window structure.addch
and friends attempt to collect bytes for multibyte encodings (it's not specific to UTF-8
, but not much used aside from this). For what it's worth, dialog
reads UTF-8 using getch
. See inputstr.c
to see how it works in practice.
X/Open curses as such does not do this (for the rare individual actually using Unix curses with UTF-8, there's no specified way).
Upvotes: 2