Nulik
Nulik

Reputation: 7350

reading ncurses stdin in UTF-8

In my Linux program being developed in C with ncurses I need to read the stdin in UTF-8 encoding. However, whenever I do :

wint_t unicode_char=0;
get_wch(&unicode_char);

I get the wide character in utf-16 encoding (I can see it when I dump the variable with gdb). I do not want to convert it from utf-16 to utf-8, I want to force the input to be in UTF-8 all the time, no matter which Linux distribution runs my program with whatever foreign language the user has it configured. How is this done? Is it possible?

EDIT: Here is the example source and proof that internally get_wch uses UTF-16 (which is the same as UTF-32) and not UTF-8, despite that I configured UTF-8 input source with setlocale().

[niko@dev1 ncurses]$ gcc -g -o getch -std=c99 $(ncursesw5-config --cflags --libs) getch.c 
[niko@dev1 ncurses]$ cat getch.c 
#define _GNU_SOURCE
#include <locale.h>
#include <ncursesw/ncurses.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int ct;
wint_t unichar;

int main(int argc, char *argv[])
{
    setlocale(LC_ALL, ""); /* make sure UTF8 */
    initscr();
    raw();
    keypad(stdscr, TRUE);
    ct = get_wch(&unichar); /* read character */
    mvprintw(24, 0, "Key pressed is = %4x ", unichar);

    refresh();
    getch();
    endwin();
    return 0;
}

Testing code with GDB:

🔎
Breakpoint 1, main (argc=1, argv=0x7fffffffded8) at getch.c:18
18      mvprintw(24, 0, "Key pressed is = %4x ", unichar);
Missing separate debuginfos, use: dnf debuginfo-install ncurses-libs-5.9-21.20150214.fc23.x86_64
(gdb) print unichar
$1 = 128270
(gdb) print/x ((unsigned short*) (&unichar))[0]
$2 = 0xf50e
(gdb) print/x ((unsigned short*) (&unichar))[1]
$3 = 0x1
(gdb) print/x ((unsigned char*) (&unichar))[0]
$4 = 0xe
(gdb) print/x ((unsigned char*) (&unichar))[1]
$5 = 0xf5
(gdb) print/x ((unsigned char*) (&unichar))[2]
$6 = 0x1
(gdb) print/x ((unsigned char*) (&unichar))[3]
$7 = 0x0
(gdb) 

The input character is 🔎, and its UTF-8 should be 'f09f948e' as stated here: http://www.fileformat.info/info/unicode/char/1f50e/index.htm

How do I get UTF8 directly from get_wch() ?? Or maybe there is another function ?

P.S. if you test the source code, link against '-lncursesw' , not '-lncurses' or compile with the same command as I did above

Upvotes: 4

Views: 2575

Answers (1)

Thomas Dickey
Thomas Dickey

Reputation: 54465

Short: you don't get UTF-8 from get_wch. That returns a wint_t (and a status code).

Long: you would get UTF-8 from ncurses getch because it converts to/from wchar_t internally:

  • Your program would have to read the encoded character one byte at a time, because getch only returns bytes (possibly combined with video attributes).
  • ncurses stores wchar_t values in the cells of each window structure.
  • addch and friends attempt to collect bytes for multibyte encodings (it's not specific to UTF-8, but not much used aside from this).
  • The attempt fails if you move the cursor in the middle of a string.

For what it's worth, dialog reads UTF-8 using getch. See inputstr.c to see how it works in practice.

X/Open curses as such does not do this (for the rare individual actually using Unix curses with UTF-8, there's no specified way).

Upvotes: 2

Related Questions