Reputation: 15338

Perl + Curses: Expecting a UTF-8 encoded multibyte character from getchar(), but not getting any

I am trying out Bryan Henderson's Perl interface to the ncurses library: Curses

For a simple exercise, I try to obtain single characters typed on-screen. This is directly based off the NCURSES Programming HOWTO, with adaptations.

When I call the Perl library's getchar(), I expect to receive a character, possibly multibyte (It's a bit more complicated as explained in this part of the library manpage because one has to handle the special cases of function keys and no input, but that's just the usual curlicues).

It's the subroutine read1ch() in the code below.

This works well for ASCII characters, but doesn't work for characters above 0x7F. For example, when hitting è (Unicode 0x00E8, UTF-8: 0xC3, 0xA8), I actually obtain code 0xE8 instead of something UTF-8 encoded. Printing it out to the terminal for which LANG=en_GB.UTF-8 is not working and anyway I was expecting 0xC3A8.

What do I need to change to make it work, i.e. get the è either as a proper character or a Perl string?

The C code snipped for getchar() is here btw. Maybe it just didn't get compiled with C_GET_WCH set? How to find out?

Addenda

Addendum 1

Tried with setting the binmode using

binmode STDERR, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';

which should fix any encoding issues because the terminal expects and sends UTF-8, but that didn't help.

Also tried setting the stream encoding with use open (not quite sure about the difference between this and the approach above), but that didn't help either

use open qw(:std :encoding(UTF-8));

Addendum 2

The manpage for the Perl Curses shim says:

If wget_wch() is not available (i.e. The Curses library does not understand wide characters), this calls wgetch() [get a 1-byte char from a curses window], but returns the values described above nonetheless. This can be a problem because with a multibyte character encoding like UTF-8, you will receive two one-character strings for a two-byte-character (e.g. "Ã" and "¤" for "ä").

This may be the case here, but wget_wch() does exist on this system.

Addendum 3

Tried to see what the C code does and added an fprintf directly into the multibyte handling code of curses/Curses-1.36/CursesFunWide.c, recompiled, didn't manage to override the system Curses.so with my own via LD_LIBRARY_PATH (why not? why is everything only working half of the time?), so replaced the system library directly in place (take THAT!).

#ifdef C_GET_WCH
    wint_t wch;
    int ret = wget_wch(win, &wch);
    if (ret == OK) {
        ST(0) = sv_newmortal();
        fprintf(stderr,"Obtained win_t 0x%04lx\n", wch);
        c_wchar2sv(ST(0), wch);
        XSRETURN(1);
    } else if (ret == KEY_CODE_YES) {
        XST_mUNDEF(0);
        ST(1) = sv_newmortal();
        sv_setiv(ST(1), (IV)wch);
        XSRETURN(2);
    } else {
        XSRETURN_UNDEF;
    }
#else

That's just a fat NOPE, when pressing ü one sees:

Obtained win_t 0x00fc

So the correct code is run, but the data is ISO-8859-1, not UTF-8. So it's wget_wch which behaves badly. So it's a curses config problem. Huh.

Addendum 4

It struck me that maybe ncurses was assuming default locale, i.e. C. To make it ncurses work with wide characters, one has to "initialize the locale", which probably means moving state from "unset" (and thus making ncurses fall back to C) to "set to what the system indicates" (which should be what is in the LANG environment variable). The man page for ncurses says:

The library uses the locale which the calling program has initialized. That is normally done with setlocale:

setlocale(LC_ALL, "");

If the locale is not initialized, the library assumes that characters are printable as in ISO-8859-1, to work with certain legacy programs. You should initialize the locale and not rely on specific details of the library when the locale has not been setup.

This didn't work either, but I feel that the solution is down that road.

Addendum 5

The win_t (apparently the same as wchar_t) conversion code from CursesWide.c, converts the wint_t (here seen as wchar_t) received from wget_wch() into a Perl string. SV is the "scalar value" type.

Here with two fprintf inserted to see what is going on:

static void
c_wchar2sv(SV *    const sv,
           wchar_t const wc) {
/*----------------------------------------------------------------------------
  Set SV to a one-character (not -byte!) Perl string holding a given wide
  character
-----------------------------------------------------------------------------*/
    if (wc <= 0xff) {
        char s[] = { wc, 0 };
        fprintf(stderr,"Not UTF-8 string: %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF);
        sv_setpv(sv, s);
        SvPOK_on(sv);
        SvUTF8_off(sv);
    } else {
        char s[UTF8_MAXBYTES + 1] = { 0 };
        char *s_end = (char *)UVCHR_TO_UTF8((U8 *)s, wc);
        *s_end = 0;
        fprintf(stderr,"UTF-8 string: %02x %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF, ((int)s[2])&0xFF);
        sv_setpv(sv, s);
        SvPOK_on(sv);
        SvUTF8_on(sv);
    }
}

Test code using perl-Curses

Tried with perl-Curses-1.36-9.fc30.x86_64
Tried with perl-Curses-1.36-11.fc31.x86_64

If you try it, hit BACKSPACE to get out of the loop, because CTRL-C is no longer interpreted.

A lot of code below, but the critical area is marked with ----- Testing:

#!/usr/bin/perl

# pmap -p PID
# shows the per process using 
# /usr/lib64/libncursesw.so.6.1
# /usr/lib64/perl5/vendor_perl/auto/Curses/Curses.so

# Trying https://metacpan.org/release/Curses

use warnings;
use strict;
use utf8;          # Meaning "This lexical scope (i.e. file) contains utf8"

use Curses;        # On Fedora: dnf install perl-Curses

# This didn't fix it 
# https://perldoc.perl.org/open.html

use open qw(:std :encoding(UTF-8));

# https://perldoc.perl.org/perllocale.html#The-setlocale-function

use POSIX ();
my $loc = POSIX::setlocale(&POSIX::LC_ALL, "");

# ---
# Surrounds the actual program
# ---

sub setup() {
   initscr();
   raw();
   keypad(1);
   noecho();
}

sub teardown {
   endwin();
}

# ---
# Mainly for prettyprinting
# ---

my $special_keys = setup_special_keys();

# ---
# Error printing
# ---

sub mt {
   return sprintf("%i: ",time());
}

sub ae {
   my ($x,$fname) = @_;
   if ($x == ERR) { 
      printw mt();
      printw "Got error code from '$fname': $x\n"
   }
}

# ---
# Where the action is
# ---

sub announce {
   my $res = printw "Type any character to see it in bold! (or backspace to exit)\n";
   ae($res, "printw");
   return { refresh => 1 }
}

sub read1ch {
   # Read a next character, waiting until it is there.
   # Use the wide-character aware functions unless you want to deal with
   # collating individual bytes yourself!
   # Readings:
   # https://metacpan.org/pod/Curses#Wide-Character-Aware-Functions
   # https://perldoc.perl.org/perlunicode.html#Unicode-Character-Properties
   # https://www.ahinea.com/en/tech/perl-unicode-struggle.html
   # https://hexdump.wordpress.com/2009/06/19/character-encoding-issues-part-ii-perl/
   my ($ch, $key) = getchar();
   if (defined $key) {
      # it's a function key
      printw "Function key pressed: $key"; 
      printw " with known alias '" . $$special_keys{$key} . "'" if (exists $$special_keys{$key});
      printw "\n";
      # done if backspace was hit
      return { done => ($key == KEY_BACKSPACE()) }
   }
   elsif (defined $ch) {
      # "$ch" should be a String of 1 character

      # ----- Testing

      printw "Locale: $loc\n";
      printw "Multibyte output test: öüäéèà периоду\n";
      printw sprintf("Received string '%s' of length %i with ordinal 0x%x\n", $ch, length($ch), ord($ch));

      {
         # https://perldoc.perl.org/bytes.html
         use bytes;
         printw sprintf("... length is %i\n"     , length($ch));
         printw sprintf("... contents are %vd\n" , $ch);
      }

      # ----- Testing

      return { ch => $ch }
   }
   else {
      # it's an error
      printw "getchar() failed\n";
      return {}
   }
}

sub feedback {
   my ($ch) = @_;
   printw "The pressed key is: ";
   attron(A_BOLD);
   printw("%s\n","$ch"); # do not print $txt directly to make sure escape sequences are not interpreted!
   attroff(A_BOLD);
   return { refresh => 1 }  # should refresh
}

sub do_curses_run {

   setup;

   my $done = 0;
   while (!$done) {
      my $bubl;
      $bubl = announce(); 
      refresh() if $$bubl{refresh};
      $bubl = read1ch();
      $done = $$bubl{done};
      if (defined $$bubl{ch}) {
         $bubl = feedback($$bubl{ch}); 
         refresh() if $$bubl{refresh};
      }
   }

   teardown;
}

# ---
# main
# ---

do_curses_run();


sub setup_special_keys {
   # the key codes on the left must be called once to resolve to a numeric constant!
   my $res = {
      KEY_BREAK()       => "Break key",
      KEY_DOWN()        => "Arrow down",
      KEY_UP()          => "Arrow up",
      KEY_LEFT()        => "Arrow left",
      KEY_RIGHT()       => "Arrow right",
      KEY_HOME()        => "Home key",
      KEY_BACKSPACE()   => "Backspace",
      KEY_DL()          => "Delete line",
      KEY_IL()          => "Insert line",
      KEY_DC()          => "Delete character",
      KEY_IC()          => "Insert char or enter insert mode",
      KEY_EIC()         => "Exit insert char mode",
      KEY_CLEAR()       => "Clear screen",
      KEY_EOS()         => "Clear to end of screen",
      KEY_EOL()         => "Clear to end of line",
      KEY_SF()          => "Scroll 1 line forward",
      KEY_SR()          => "Scroll 1 line backward (reverse)",
      KEY_NPAGE()       => "Next page",
      KEY_PPAGE()       => "Previous page",
      KEY_STAB()        => "Set tab",
      KEY_CTAB()        => "Clear tab",
      KEY_CATAB()       => "Clear all tabs",
      KEY_ENTER()       => "Enter or send",
      KEY_SRESET()      => "Soft (partial) reset",
      KEY_RESET()       => "Reset or hard reset",
      KEY_PRINT()       => "Print or copy",
      KEY_LL()          => "Home down or bottom (lower left)",
      KEY_A1()          => "Upper left of keypad",
      KEY_A3()          => "Upper right of keypad",
      KEY_B2()          => "Center of keypad",
      KEY_C1()          => "Lower left of keypad",
      KEY_C3 ()         => "Lower right of keypad",
      KEY_BTAB()        => "Back tab key",
      KEY_BEG()         => "Beg(inning) key",
      KEY_CANCEL()      => "Cancel key",
      KEY_CLOSE()       => "Close key",
      KEY_COMMAND()     => "Cmd (command) key",
      KEY_COPY()        => "Copy key",
      KEY_CREATE()      => "Create key",
      KEY_END()         => "End key",
      KEY_EXIT()        => "Exit key",
      KEY_FIND()        => "Find key",
      KEY_HELP()        => "Help key",
      KEY_MARK()        => "Mark key",
      KEY_MESSAGE()     => "Message key",
      KEY_MOUSE()       => "Mouse event read",
      KEY_MOVE()        => "Move key",
      KEY_NEXT()        => "Next object key",
      KEY_OPEN()        => "Open key",
      KEY_OPTIONS()     => "Options key",
      KEY_PREVIOUS()    => "Previous object key",
      KEY_REDO()        => "Redo key",
      KEY_REFERENCE()   => "Ref(erence) key",
      KEY_REFRESH()     => "Refresh key",
      KEY_REPLACE()     => "Replace key",
      KEY_RESIZE()      => "Screen resized",
      KEY_RESTART()     => "Restart key",
      KEY_RESUME()      => "Resume key",
      KEY_SAVE()        => "Save key",
      KEY_SBEG()        => "Shifted beginning key",
      KEY_SCANCEL()     => "Shifted cancel key",
      KEY_SCOMMAND()    => "Shifted command key",
      KEY_SCOPY()       => "Shifted copy key",
      KEY_SCREATE()     => "Shifted create key",
      KEY_SDC()         => "Shifted delete char key",
      KEY_SDL()         => "Shifted delete line key",
      KEY_SELECT()      => "Select key",
      KEY_SEND()        => "Shifted end key",
      KEY_SEOL()        => "Shifted clear line key",
      KEY_SEXIT()       => "Shifted exit key",
      KEY_SFIND()       => "Shifted find key",
      KEY_SHELP()       => "Shifted help key",
      KEY_SHOME()       => "Shifted home key",
      KEY_SIC()         => "Shifted input key",
      KEY_SLEFT()       => "Shifted left arrow key",
      KEY_SMESSAGE()    => "Shifted message key",
      KEY_SMOVE()       => "Shifted move key",
      KEY_SNEXT()       => "Shifted next key",
      KEY_SOPTIONS()    => "Shifted options key",
      KEY_SPREVIOUS()   => "Shifted prev key",
      KEY_SPRINT()      => "Shifted print key",
      KEY_SREDO()       => "Shifted redo key",
      KEY_SREPLACE()    => "Shifted replace key",
      KEY_SRIGHT()      => "Shifted right arrow",
      KEY_SRSUME()      => "Shifted resume key",
      KEY_SSAVE()       => "Shifted save key",
      KEY_SSUSPEND()    => "Shifted suspend key",
      KEY_SUNDO()       => "Shifted undo key",
      KEY_SUSPEND()     => "Suspend key",
      KEY_UNDO()        => "Undo key"
   };

   for (my $f = 1; $f <= 64; $f++) {
      $$res{KEY_F($f)} = "KEY_F($f)"   
   }

   return $res

}

Upvotes: 3

Answers (3)

ikegami

Reputation: 386461

[ This answer assumes libncursesw is available and being used. Trying to output "wide characters" without wide character support makes no sense :) ]

Short Answer

getchar works fine. It returns a string of Unicode Code Points (aka decoded text), which is ideal.

printw is broken, but it can be made to accept a string of Unicode Code Points (aka decoded text) by adding the following to the program:

{
   # Add wide character support to printw.
   # This only modifies the current package (main),
   # so it won't affect any code by ours.
   no warnings qw( redefine );
   sub printw { addstring(sprintf shift, @_) }
}

Is There a Problem with getchar?

So you believe there's a problem with getchar. Let's try to confirm that by checking what getchar returns. We'll do that by adding the following:

printw("String received from getchar: %vX\n", $ch);

(%vX will print the value each character of the string in hex, joined by periods.)

When pressing e (U+0065), a 7-bit char, one sees:
```
String received from getchar: 65
```
When pressing é (U+00E9), an 8-bit char, one sees:
```
String received from getchar: E9
```
When pressing ē (U+0113), a 9-bit char, one sees:
```
String received from getchar: 113
```

In all three cases, we get a string that's exactly one character long, and that character consists of the Unicode Code Point of the input.^[1] This is exactly what we want. Applying and removing character encodings should be done at the periphery so that the main logic of the program doesn't have to worry about encodings, and this is being done.

Conclusion: There is no problem with getchar.

Is There a Problem with printw?

So the problem must be with the output. To confirm this, I added the following to your program:

sub _d { utf8::downgrade( my $s = shift ); $s }
sub _u { utf8::upgrade(   my $s = shift ); $s }

for (
   [ "7-bit, UTF8=0" => _d(chr(0x65)) ],   # Expect e
   [ "7-bit, UTF8=1" => _u(chr(0x65)) ],   # Expect e
   [ "8-bit, UTF8=0" => _d(chr(0xE9)) ],   # Expect é
   [ "8-bit, UTF8=1" => _u(chr(0xE9)) ],   # Expect é
   [ "9-bit, UTF8=1" => chr(0x113)    ],   # Expect ē
) {
   my ($name, $chr) = @$_;
   printw("%s: %s\n", $name, $chr);
}

Output:

7-bit, UTF8=0: e
7-bit, UTF8=1: e
8-bit, UTF8=0:
8-bit, UTF8=1: é
9-bit, UTF8=1:  S

From the above, we observe:

We see a difference between the results for _d(chr(0xE9)) and _u(chr(0xE9)) even though both scalars contain the same string (_d(chr(0xE9)) eq _u(chr(0xE9)) is true). This function therefore suffers from The Unicode Bug.
Based on the 8-bit tests, it appears to accept Unicode Code Points (decoded text) rather than UTF-8. This is ideal.
Based on the 9-bit test, it doesn't appear to accept Unicode Code Points. Subsequent testing shows it doesn't accept the UTF-8 encoding of chr(0x113) either.

Conclusion: There are major problems with printw.

Fixing the Problem with printw

Working around The Unicode Bug is easy, but the lack of support for characters above 0xFF is a show-stopper. Let's dig into the code.

Ok, we don't have to look far for the problem. We see that printw is defined in terms of addstr, and addstr predates wide character support. addstring is the counterpart with wide character support, so let's make printw use addstring instead of addstr.

{
   # Add wide character support to printw.
   # This only modifies the current package (main),
   # so it won't affect any code by ours.
   no warnings qw( redefine );
   sub printw { addstring(sprintf shift, @_) }
}

Output:

7-bit, UTF8=0: e
7-bit, UTF8=1: e
8-bit, UTF8=0: é
8-bit, UTF8=1: é
9-bit, UTF8=1: ē

Bingo!

From the above, we observe:

We see no differences between the results for the UTF8=0 tests and their corresponding UTF8=1 tests. Therefore, this function doesn't suffer from The Unicode Bug.
It accepts strings of Unicode Code Points (decoded text) throughout. Notably, it's not expecting UTF-8 or the locale's encoding.

This is exactly what we expect/desire.

Specifically, getchar isn't returning the iso-8859-1 encoding of the input as you believed. The confusion is understandable because Unicode is an extension of iso-8859-1.

Upvotes: 5

David Tonhofer

Reputation: 15338

Thomas Dickey correctly noted that the right data is received.

This took me some time to actually ascertain.

The confusion s down to the fact that that Perl's sprintf can't handle UTF-8 and Perl Curses printw can't handle the region 0x80 to 0x7F.

This took even longer to ascertain.

In fact, I have opened a new question about this:

Are there one (or two) solid bugs in the `curses` shim for Perl?

Upvotes: 2

Thomas Dickey

Reputation: 54573

Actually it looks correct.

Running your script with strace can help... I did this to see the system calls:

strace -fo strace.out -s 1024 ./foo

and could see the reads, messages, etc. Getting a similar trace for ncurses could be done using a debug-library, though packagers haven't been consistent about providing one with tracing enabled.

ü in UTF-8 is \303\274 (octal), and its Unicode value is 252 (decimal), or 0xfc (hexadecimal). This part of the question seems to have missed that point:

That's just a fat NOPE, when pressing ü one sees:

Obtained win_t 0x00fc

So the correct code is run, but the data is ISO-8859-1, not UTF-8. So it's wget_wch which behaves badly. So it's a curses config problem. Huh.

wget_wch returns (for practical purposes) a Unicode value (not a sequence of UTF-8 bytes). The ISO-8859-1 codes 160-255 happen to (not coincidentally) match the Unicode code-points, though the latter would certainly be encoded differently in UTF-8.

wgetch would return the UTF-8 bytes, but the Perl script would only use that as a fallback (since that would lead to having the Perl script convert UTF-8 strings to Unicode values).