to_tsvector in simple mode throwing away non english in some setups

Question

On some pg installs I am noticing the following happens

sam=# select '你好 世界'::tsvector;
   tsvector    
---------------
 '世界' '你好'
(1 row)

sam=# select to_tsvector('simple', '你好 世界');
 to_tsvector 
-------------

(1 row)

Even though my db is configured like so:

MBA:bin sam$ ./psql -l
                              List of databases
   Name    | Owner | Encoding |   Collate   |    Ctype    | Access privileges
-----------+-------+----------+-------------+-------------+-------------------
 postgres  | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 |
 sam       | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 |
 template0 | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 | =c/sam           +
           |       |          |             |             | sam=CTc/sam
 template1 | sam   | UTF8     | en_AU.UTF-8 | en_AU.UTF-8 | =c/sam           +
           |       |          |             |             | sam=CTc/sam
(4 rows)

On other similar setups I am seeing select to_tsvector('simple', '你好世界'); correctly return the tokens.

How do I diagnose the simple tokeniser to figure out why it is chucking out these letters?

Simplest repro seems to be installing postgres via postgres app. Does not happen when installing postgres on ubuntu with a locale set.

Paul Guyot · Accepted Answer

Unfortunately, default parser used by text search highly depends on the database initialization and especially on lc_collate and the current database object encoding.

This is due to some inner working of the default text parser. It is vaguely documented:

Note: The parser's notion of a "letter" is determined by the database's locale setting, specifically lc_ctype. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them.

The important part is these comments in PostgreSQL source code:

/* [...]
 * Notes:
 *  - with multibyte encoding and C-locale isw* function may fail
 *    or give wrong result.
 *  - multibyte encoding and C-locale often are used for
 *    Asian languages.
 *  - if locale is C then we use pgwstr instead of wstr.
 */

and below:

/*
 * any non-ascii symbol with multibyte encoding with C-locale is
 * an alpha character
 */

Consequently, if you want to use the default parser with Chinese, make sure your database is initialized with the C locale and you have a multibyte encoding, so all characters above U+007F will be treated as alpha (including spaces such as IDEOGRAPHIC SPACE U+3000 !). Typically, the following initdb call will do what you expect:

initdb --locale=C -E UTF-8

Otherwise, Chinese characters will be skipped and treated as blank.

You can check this with debug function ts_debug. With a database initialized with lc_collate=en_US.UTF-8 or any other configuration where tokenization fails, you will get:

SELECT * FROM ts_debug('simple', '你好 世界');
 alias |  description  |   token   | dictionaries | dictionary | lexemes 
-------+---------------+-----------+--------------+------------+---------
 blank | Space symbols | 你好 世界 | {}            |            |

Conversely, with lc_collate=C and a UTF-8 database (initialized as above), you will get the proper result:

SELECT * FROM ts_debug('simple', '你好 世界');
 alias |    description    | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+--------------+------------+---------
 word  | Word, all letters | 你好  | {simple}     | simple     | {你好}
 blank | Space symbols     |       | {}           |            | 
 word  | Word, all letters | 世界  | {simple}     | simple     | {世界}

It seems, however, that you mean to tokenize Chinese text where words are already separated by regular spaces, i.e. tokenization/segmentation does not happen within PostgreSQL. For this use case, I strongly suggest using a custom parser. This is especially true if you do not use other features of PostgreSQL simple parser, such as tokenizing URLs.

A parser tokenizing on space characters is very easy to implement. In fact, in contrib/test_parser, there is a sample code doing exactly that. This parser will work whatever the locale. There was a buffer overrun bug in this parser that was fixed in 2012, make sure you use a recent version.

to_tsvector in simple mode throwing away non english in some setups

Answers (1)

Related Questions