Reputation: 58
I am looking into PostgreSQL full text search and would like to know if it is possible to search for the second part of compound words.
-- Lets have a table like this:
CREATE TABLE IF NOT EXISTS table1(
id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
col1 TEXT,
col1_tsv TSVECTOR
);
CREATE INDEX IF NOT EXISTS col1_index ON table1 USING gin(col1_tsv);
-- Insert some data into it:
INSERT INTO table1 (col1, col1_tsv)
VALUES ('Blacklist', TO_TSVECTOR('Blacklist')),('Cheesecake', TO_TSVECTOR('Cheesecake'));
If I search for 'cake' or 'list' I get no results.
SELECT col1 FROM table1 WHERE col1_tsv @@ to_tsquery('english', 'list');
SELECT col1 FROM table1 WHERE col1_tsv @@ to_tsquery('english', 'cake');
Check it with ts_lexize:
select ts_lexize('english_stem','Blacklist');
select ts_lexize('english_stem','Cheesecake');
The output:
ts_lexize
-------------
{blacklist}
(1 row)
ts_lexize
-------------
{cheesecak}
(1 row)
Works as designed but is there a way to get the Cheesecake just from searching for cake? (And I do not mean)
select * from table1 where col1 like '%cake%';
When I select the whole table the Cheesecake is also cut into Cheesecak.
select * from table1;
id | col1 | col1_tsv
----+------------+---------------
1 | Blacklist | 'blacklist':1
2 | Cheesecake | 'cheesecak':1
Upvotes: 3
Views: 1201
Reputation: 1670
There is a solution for such cases: you need a Hunspell dictionary for the languages you want to support. These dictionaries must also define compound word rules. If these requirements are met, Postgres can decompose compound words into their components and index them so that they become findable.
I show here an example for the language German, where many compound words are used:
First we need a suitable Hunspell dictionary with compound word rules. After a while of researching, I found one: https://github.com/vpikulik/hunspell_de_compounds. You can see that a Hunspell dictionary defines compound word rules if its *.aiff
file contains a line like compoundwords controlled _
.
Rename the file extensions to be *.affix
and *.dict
. Postgres expects it like this.
Postgres wants, that Hunspell dictionaries are UTF8 encoded. Thus, I open the *.affix
and *.dict
files with Sublime Text, call File
, Save with Encoding
, UTF-8
for both files.
You need to copy both files to the database machine (or container etc.) Open a terminal on the database machine, and move the files to the correct location:
destination=$(echo $(pg_config --sharedir)/tsearch_data)
mv de_DE.affix $destination
mv de_DE.dict $destination
Here, pg_config --sharedir
yields the shared directory of the Postgres installation. The destination for the dictionary is the tsearch_data
subdirectory.
Connect to your database (locally or remote), e.g. locally by psql
command.
Now, we create (a) our own text search dictionary and (b) our own text search configuration inside of Postgres. We call both german_hunspell
. Here is the code:
We drop a previously created configuration + dictionary. Just in case, we want to repeat the process e.g. because we want to use another dictionary.
DROP TEXT SEARCH DICTIONARY german_hunspell CASCADE;
We create the dictionary:
CREATE TEXT SEARCH DICTIONARY german_hunspell
(TEMPLATE = ispell, DictFile = de_DE, AffFile = de_DE, Stopwords = german);
Here, DictFile = de_DE
, Postgres expects a file de_DE.dict
; for AffFile = de_DE
Postgres expects a file de_DE.affix
.
We create a new text search configuration by deriving it from the german
configuration provided in Postgres:
CREATE TEXT SEARCH CONFIGURATION german_hunspell (COPY = german);
Next, we modify the configuration we created previously. We define that Postgres should use our new configuration for words of all kinds. If our Hunspell dictionary does not have any rules for a particular word, we forward the request to the default stemer for German:
ALTER TEXT SEARCH CONFIGURATION german_hunspell
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH german_hunspell, german_stem;
Done. We can test that it works by using the ts_debug
command:
SELECT * FROM ts_debug('german_hunspell', 'Wettersystemsimulationssoftware');
Here, Wettersystemsimulationssoftware
is a German compound word. It gets splitted into wetter
, system
, simulation
, software
. When a user searches e.g. for system
, Postgres would found this entry.
In order to use our configuration, you have to specify it for any to_tsvector
, to_tsquery
, websearch_to_tsquery
, etc. commands. Here are a few examples:
SELECT to_tsvector('german_hunspell', 'content goes here');
SELECT to_tsquery('german_hunspell', 'query goes here');
SELECT websearch_to_tsquery('german_hunspell', 'query goes here');
...
It works from most (any?) languages as well, e.g. by using C#, as long as the driver allows you to specify the configuration to use. Some driver like the C# ones using a binary interface for communication with the database. In such a case, you cannot address a configuration by its name e.g. german_hunspell
. Instead, you have to query for it's OID like this:
SELECT oid from pg_catalog.pg_ts_config where cfgname = 'german_hunspell';
Then, you can cache this OID and work with it.
You can repeat this process with any language you want. Unfortunately, there are not appropriate Hunspell dictionaries for all languages available. I would like to have a dictionary for English. Suppose my record contains the word Spaceship
, then I would like users to be able to search for Ship
and find that record. Unfortunately, after several hours of research, I have not found a suitable English dictionary. This is where the open source community could get active...
Upvotes: 4
Reputation: 656291
Full text search is capable of prefix matching. See:
But that only works for left-anchored searches. Your pattern is right-anchored.
You could build an index on reversed strings and search with reversed patterns:
CREATE INDEX table1_col1_rtsv_idx ON table1 USING gin (TO_TSVECTOR('simple', reverse(col1)));
Then this query with prefix search can use the new index:
SELECT col1 FROM table1
WHERE to_tsvector('simple', reverse(col1))
@@ to_tsquery('simple', reverse('cake') || ':*');
But I would consider a trigram index instead. See:
CREATE INDEX table1_col1_gin_trgm_idx ON table1 USING gin (col1 gin_trgm_ops);
Query:
SELECT col1 FROM table1
WHERE col1 LIKE '%cake';
Notably, the pattern is '%cake'
, not '%cake%'
, if "cake" shall be at the end of the string. But the trigram index supports this just as well:
SELECT col1 FROM table1
WHERE col1 LIKE '%cake%';
db<>fiddle here
Text search indexes are typically substantially smaller than trigram indexes - and therefore a bit faster. There are lots of other subtle differences ...
Upvotes: 1
Reputation: 246093
If you want to handle composites correctly and are not interested in substring matching, I think you need a thesaurus dictionary. For each composite you want to search for, you would have to add entries like
cheesecak : cheesecak chees cak
blacklist : blacklist black list
That way, you retain the original word and add its parts.
Annoying, but there is no automated way of detecting composites. For example, "havelock" has nothing to do with "lock", and a "haberdasher" need not "dash".
Upvotes: 1