user3871
user3871

Reputation: 12718

using to_tsvector and to_tsquery to filter non roman characters

I want to allow multilingual search support for my app.

Postgresql 9.6 Search Controls says I need tsvector and tsquery to properly parse/normalize text. This works fine with roman-based languages, but not non-roman characters.

Considering this search snippet

where to_tsvector(title) @@ to_tsquery('hola')

I am looking for a title with "hola mi amiga", and it is found. However, given:

where to_tsvector(title) @@ to_tsquery('你') //language = Chinese, Code = zh-CN

I am looking for a title with 你好嗎 and it is not found.

What considerations should I take to allow string normalization to work with non roman characters?

Upvotes: 1

Views: 1169

Answers (1)

Evan Carroll
Evan Carroll

Reputation: 1

Make sure you set the configuration right

default_text_search_config (string) Selects the text search configuration that is used by those variants of the text search functions that do not have an explicit argument specifying the configuration. See Chapter 12 for further information. The built-in default is pg_catalog.simple, but initdb will initialize the configuration file with a setting that corresponds to the chosen lc_ctype locale, if a configuration matching that locale can be identified.

You can see the current value with

SHOW default_text_search_config;
or SELECT get_current_ts_config();

You can change it for the session with SET default_text_search_config = newconfiguration; Or, you can use ALTER DATABASE <db> SET default_text_search_config = newconfiguration

From Chapter 12. Full Text Search

During installation an appropriate configuration is selected and default_text_search_config is set accordingly in postgresql.conf. If you are using the same text search configuration for the entire cluster you can use the value in postgresql.conf. To use different configurations throughout the cluster but the same configuration within any one database, use ALTER DATABASE ... SET. Otherwise, you can set default_text_search_config in each session.

Each text search function that depends on a configuration has an optional regconfig argument, so that the configuration to use can be specified explicitly. default_text_search_config is used only when this argument is omitted.

You can use \dF to see the text search configurations you have installed.

So what you want, is something like this

where to_tsvector('newconfig', title) @@ to_tsquery('newconfig', '你')

No idea what language the query is in to answer this question, or what configuration will properly stem that language.

Upvotes: 1

Related Questions