stach
stach

Reputation: 2171

Postgresql full text search in postgresql - japanese, chinese, arabic

I'm designing a fulltext search function in postgresql for my current project. It works ok with ispell/myspell dictionaries so far. Now I need to add support for chinese, japanese and arabic search. Where do I start? There are no templates or dictionaries available for those languages as far as I can see. Will it work with pg_catalog.simple configuration?

Upvotes: 15

Views: 7715

Answers (4)

Jacques
Jacques

Reputation: 1093

For those who are landing here for PostgreSQL fulltext search in Japanese, here is the way to do it on ubuntu:

Install the following softwares and development environments:

apt-get install libmecab-dev libmecab2 mecab-ipadic-utf8 mecab-utils libmecab-perl libtext-mecab-perl mecab mecab-jumandic-utf8

Download textsearch_ja from https://www.postgresql.org/ftp/projects/pgFoundry/textsearch-ja/textsearch_ja/9.0.0/

The following is for those using version 12 of PostgreSQL:

cd textsearch_ja-9.0.0
make USE_PGXS=1 PG_CONFIG=/usr/lib/postgresql/12/bin/pg_config
sudo make USE_PGXS=1 PG_CONFIG=/usr/lib/postgresql/12/bin/pg_config install

which will yield something like:

/bin/mkdir -p '/usr/lib/postgresql/12/lib'
/bin/mkdir -p '/usr/share/postgresql/12/contrib'
/usr/bin/install -c -m 755  textsearch_ja.so '/usr/lib/postgresql/12/lib/textsearch_ja.so'
/usr/bin/install -c -m 644 .//uninstall_textsearch_ja.sql textsearch_ja.sql '/usr/share/postgresql/12/contrib/'
/bin/mkdir -p '/usr/lib/postgresql/12/lib/bitcode/textsearch_ja'
/bin/mkdir -p '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/ '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/pgut/
/usr/bin/install -c -m 644 textsearch_ja.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 encoding_eucjp.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 encoding_utf8.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/./
/usr/bin/install -c -m 644 pgut/pgut-be.bc '/usr/lib/postgresql/12/lib/bitcode'/textsearch_ja/pgut/
cd '/usr/lib/postgresql/12/lib/bitcode' && /usr/lib/llvm-10/bin/llvm-lto -thinlto -thinlto-action=thinlink -o textsearch_ja.index.bc textsearch_ja/textsearch_ja.bc textsearch_ja/encoding_eucjp.bc textsearch_ja/encoding_utf8.bc textsearch_ja/pgut/pgut-be.bc

Then, in the automatically generated textsearch_ja.sql, change LANGUAGE='C' to LANGUAGE='c' (in lowercase):

perl -pi -E "s/LANGUAGE 'C'/LANGUAGE 'c'/" textsearch_ja.sql

Then, you can use the superuser to add it to PostgreSQL:

sudo -u postgres psql -f textsearch_ja.sql

and if you want to add it for an existing database:

sudo -u postgres psql -d my_database -f textsearch_ja.sql

which will yield something like:

SET
BEGIN
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE TEXT SEARCH PARSER
COMMENT
CREATE FUNCTION
CREATE TEXT SEARCH TEMPLATE
CREATE TEXT SEARCH DICTIONARY
CREATE TEXT SEARCH CONFIGURATION
COMMENT
ALTER TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION
ALTER TEXT SEARCH CONFIGURATION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
COMMIT

To test it, as per https://github.com/HiraokaHyperTools/textsearch_ja

SELECT ja_wakachi('分かち書きを行います。');

would yield:

         ja_wakachi
----------------------------
 分かち書き を 行い ます 。
(1 row)
SELECT furigana('漢字の読みをカタカナで返します。');

would yield:

               furigana
--------------------------------------
 カンジノヨミヲカタカナデカエシマス。
(1 row)
SELECT * FROM ts_debug('japanese', E'日\n本\n語\n文\n字\n中\nの\n改\n行\nは\n除\n去\n');

would yield:

 alias |    description    | token  |  dictionaries   |  dictionary   | lexemes
-------+-------------------+--------+-----------------+---------------+----------
 word  | Word, all letters | 日本語 | {japanese_stem} | japanese_stem | {日本語}
 word  | Word, all letters | 文字   | {japanese_stem} | japanese_stem | {文字}
 word  | Word, all letters | 中     | {japanese_stem} | japanese_stem | {中}
 blank | Space symbols     | の     | {}              |               |
 word  | Word, all letters | 改行   | {japanese_stem} | japanese_stem | {改行}
 blank | Space symbols     | は     | {}              |               |
 word  | Word, all letters | 除去   | {japanese_stem} | japanese_stem | {除去}
(7 rows)
SELECT * FROM ts_debug('japanese', E'Line\nbreaks\nin\nEnglish\ntext\nare\nreserved.');

would yield:

   alias   |   description   |  token   |  dictionaries  |  dictionary  |  lexemes
-----------+-----------------+----------+----------------+--------------+-----------
 asciiword | Word, all ASCII | Line     | {english_stem} | english_stem | {line}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | breaks   | {english_stem} | english_stem | {break}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | in       | {english_stem} | english_stem | {}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | English  | {english_stem} | english_stem | {english}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | text     | {english_stem} | english_stem | {text}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | are      | {english_stem} | english_stem | {}
 blank     | Space symbols   |          | {}             |              |
 asciiword | Word, all ASCII | reserved | {english_stem} | english_stem | {reserv}
 blank     | Space symbols   | .        | {}             |              |
(14 rows)
SELECT * FROM ts_debug('japanese', '日本語とEnglishがmixedな文も解析OKです。');

would yield:

   alias   |    description    |  token  |  dictionaries   |  dictionary   |  lexemes
-----------+-------------------+---------+-----------------+---------------+-----------
 word      | Word, all letters | 日本語  | {japanese_stem} | japanese_stem | {日本語}
 blank     | Space symbols     | と      | {}              |               |
 asciiword | Word, all ASCII   | English | {english_stem}  | english_stem  | {english}
 blank     | Space symbols     | が      | {}              |               |
 asciiword | Word, all ASCII   | mixed   | {english_stem}  | english_stem  | {mix}
 blank     | Space symbols     | な      | {}              |               |
 word      | Word, all letters | 文      | {japanese_stem} | japanese_stem | {文}
 blank     | Space symbols     | も      | {}              |               |
 word      | Word, all letters | 解析    | {japanese_stem} | japanese_stem | {解析}
 asciiword | Word, all ASCII   | OK      | {english_stem}  | english_stem  | {ok}
 blank     | Space symbols     | です    | {}              |               |
 blank     | Space symbols     | 。      | {}              |               |
(12 rows)
SELECT s
  FROM regexp_split_to_table(to_tsvector('japanese',
 '語尾は基本形に戻されます。')::text, ' ') AS t(s)
  ORDER BY s;

would yield:

     s
------------
 'れる':4
 '基本形':2
 '戻す':3
 '語尾':1
(4 rows)
SELECT s
   FROM regexp_split_to_table(to_tsvector('japanese',
 'ユーザとユーザーは正規化されます。ミラーとミラは別扱い。')::text, ' ') AS t(s)
  ORDER BY s;

would yield:

      s
--------------
 'する':5
 'ミラー':7
 'ミラ':8
 'ユーザ':1,2
 'れる':6
 '別':9
 '化':4
 '扱い':10
 '正規':3
(9 rows)

Upvotes: 2

Howardsun
Howardsun

Reputation: 116

The similar solution of link at stackoverflow.com is How do I implement full text search in Chinese on PostgreSQL? .

Although that, I would provide a solution below in detail based on my experience and a solution on Internet. I use both tools of SCWS and zhparser as the solution of Chinese full-text search in postgres.

20160131 Update:
You must check whether you have installed postgresql-server-devel-{number version} because we will use pgxs function from it for creating extension in postgresql.

Step1: install SCWS.
It's remarkable that --prefix=/usr/local/scws follows ./configure . Not just has ./configure along in below 4th line.

wget http://www.xunsearch.com/scws/down/scws-1.2.2.tar.bz2
tar xvjf scws-1.2.2.tar.bz2
cd scws-1.2.2
./configure --prefix=/usr/local/scws 
make
make install

To check whether it installed successfully, please enter below command:

ls -al /usr/local/scws/lib/libscws.la


Step2: Install zhparser

git clone https://github.com/amutu/zhparser.git
cd zhparser
SCWS_HOME=/usr/local/scws/include make && make install

20160131 Update: If you use Mac OS X Yosemite, aboved value of SCWS_HOME is same. But if you use Ubuntu 14.04 LTS, please change value of SCWS_HOME to /usr/local/scws .

Step3: Configure a new extension using zhparser in Postres
Step3.1: Login your postgres database through terminal/commandline

psql yourdatabasename

Step3.2: Create extension in Postgres. You could specify what dictionary name you want.

CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION dictionarynameyouwant (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION dictionarynameyouwant ADD MAPPING FOR n,v,a,i,e,l WITH simple;


If you follow above steps, you can use the function of Postgres full-text searching in Chinese/Mandarin words.

Extra step(not necessary) in Rails for using pg_search gem: Step4. Configure the dictionary name at :dictionary attribute of :tsearch in app/models/yourmodel.rb

class YourOwnClass < ActiveRecord::Base
    ...
    include PgSearch
    pg_search_scope :functionnameyoulike, :against => [columnsyoulike1, columnsyoulike2, ...,etc], :using => { :tsearch => {:dictionary => "dictionary name you just specified in creating a extension in postgres", blah blah blah, ..., etc} }
end

Reference:
1. SCWS install tutorial
2. [email protected]
3. Francs' Post - Postgres full-text search in Chinese with zhparser and SCWS
4. Rails365.net's Post - Postgres full-text search in Chinese with pg_search gem with zhparser
5. My Post at xuite.net - Make Postgres support full text search in Mandarin/Chinese

Upvotes: 3

simon
simon

Reputation: 16340

Dictionaries won't help you too much with Chinese - you'll need to look in to NGRAM tokenising...

Upvotes: 3

Frank Heikens
Frank Heikens

Reputation: 127556

Just a hint from the manual: A large list of dictionaries is available on the OpenOffice Wiki.

Upvotes: 4

Related Questions