ceving
ceving

Reputation: 23866

How to pass Unicode correctly to Perl's DBI layer?

I have the following simple Perl wrapper for a Sqlite database:

#! /usr/bin/perl

use strict;
use warnings;
use DBI;
use Data::Dumper;

my $sql = shift;

my $dbh = DBI->connect(
    "dbi:SQLite:dbname=data.sqlite3",
    "", # no user
    "", # no pw
    {
        RaiseError => 1,
        sqlite_unicode => 1
    },
) || die $DBI::errstr;

my $sth = $dbh->prepare($sql);
$sth->execute();

print Dumper ($sth->fetchall_arrayref({}));

$sth->finish();
$dbh->disconnect();

Although I have set sqlite_unicode flag, as explained in the documentation, I can not execute queries containing Unicode characters:

$ ./sqlite.pl "select * from person where lastname = 'Schütte'"
$VAR1 = [];

When I mask the 'ü' it seems to work, although I am not sure, if the \x{fc} means Latin 1 FC or Unicode U+00FC.

$ ./sqlite.pl "select * from person where lastname like 'Sch%tte'"
$VAR1 = [
          {
            'id' => 8,
            'firstname' => undef,
            'lastname' => "Sch\x{fc}tte"
          }
        ];

When I do the same with the Sqlite command line tool it works fine:

$ sqlite3 data.sqlite3 "select * from person where lastname = 'Schütte'"
8||Schütte

Did I forget anything to tell the DBI layer to support Unicode characters?

My local encoding is UTF-8:

$ locale
LANG=de_DE.utf8
LANGUAGE=
LC_CTYPE="de_DE.utf8"
LC_NUMERIC="de_DE.utf8"
LC_TIME="de_DE.utf8"
LC_COLLATE="de_DE.utf8"
LC_MONETARY="de_DE.utf8"
LC_MESSAGES="de_DE.utf8"
LC_PAPER="de_DE.utf8"
LC_NAME="de_DE.utf8"
LC_ADDRESS="de_DE.utf8"
LC_TELEPHONE="de_DE.utf8"
LC_MEASUREMENT="de_DE.utf8"
LC_IDENTIFICATION="de_DE.utf8"
LC_ALL=

Upvotes: 3

Views: 1186

Answers (1)

ikegami
ikegami

Reputation: 385917

The first question you need to ask yourself is:

Should the query be encoded (and if so, in which encoding), or should it be decoded (i.e. Unicode Code Points)?

This should be based on sqlite_unicode, but it's not. That only affects returned strings. I suspect that DBD::SQLite, like most/all DBDs, suffer from The Unicode Bug, which means the answer to the question is based on internal details of how the string is stored.

In a UTF-8 system, it usually "just works", and I'm surprised it doesn't here. Please try

use Encode qw( decode_utf8 );

my $sql = decode_utf8(shift);

If that doesn't work, please provide the output the following (before you shift the query out of @ARGV):

use Devel::Peek qw( Dump );
Dump($ARGV[0]);

When I mask the 'ü' it seems to work, although I am not sure, if the \x{fc} means Latin 1 FC or Unicode U+00FC.

The question doesn't make any sense. In both iso-latin-1 and Unicode, FC is "ü". I believe you are actually asking

Are the returned strings encoded using iso-latin-1, or are they decoded strings (i.e. Unicode Code Points)?

With sqlite_unicode, they are the latter. You'll need to encode your output. You can do so as follows:

use open ':std', ':encoding(UTF-8)';

Upvotes: 3

Related Questions